danq@uunet.uu.net (Daniel Quinlan) (06/05/91)
For several months we've been struggling with high collision rates on a new 10baseT network. I've read previous discussions about 10 base T in comp.protocols.tcp-ip, and read the summary from a previous poster ( zjat02@trc.amoco.com (Jon A. Tankersley)). It was all interesting and useful, but hasn't solved our problems. We've observed truly phenomenal collision rates -- as high as 40% -- but so far I have not seen any firm indication that the network traffic is actually being slowed down. At the beginning of March, we moved our development net from a thick ethernet to a 10baseT (twisted pair) ethernet, as part of a building remodel. Immediately thereafter, we began to see collision rates for individual workstations in the 5-10-20% range over a 24 hour period. (Here I'm calculating collision rate as the number of collisions divided by the number of output packets, both as reported by netstat -i.) Previously, the collision rate had been 1-2% or less. (See below for a description of the network) Because it was the physical layer which changed, suspicion focused on it first. And indeed, many problems were found. The bulk of the wiring was 10baseT spec, but small sections were not, and these were replaced. Several wiring problems (things like transmit + and - being on different pairs) were found and fixed. Pair testers now show the cable in compliance in terms of length and inductance. The cable length is around 250 feet, as measured by resistance. In order to quickly try out various alternatives, I wrote a shell script which runs netstat, then a command which generates considerable ethernet traffic, and then netstat again. A collision rate is then calculated for that period. Initially, I used a cp command to generate network traffic -- copying a 2 Mbyte file from one nfs mounted filesystem to a second nfs filesystem -- both filesystems residing on Jetson, the 4/490 server. Traffic shows a 10-15% ethernet utilization when using cp in this script. When run on a sparcstation or slc, this script fairly consistently shows collision rates of 10-40%. A test was set up with just two machines on the network: an SLC and jetson. Both were in the machine room with short twisted pair drop cables; several different brands of concentrators and transceivers were tried. Surprisingly, we found that with Cabletron transceivers and a Cabletron hub, the collision rate went to zero! This was also the case with another hub, but not the case for two others. With all the workstations back on the hub, the rate went up to a few percent: acceptable and a vast improvement over the "normal" rates we had been seeing. So we switched to the Cabletron hub and replaced a few of our transceivers. We plan to change all our transceivers to Cabletron, and permanently switch the hub (currently, we have just borrowed a hub and transceivers from Cabletron). Unfortunately, the collision rates went back to very high once we moved the SLC out of the machine room and back up where it belonged. There may have been a slight decrease in the 24 hour collision rate, but the data is quite noisy and hard to interpret. This might lead you to suspect the cabling between the hub and the remote stations. However, we took a sparcstation to the machine room and tested the collision rate with a short ( ~ 6ft ) drop cable, and a longer ( 200 ft > l > 100 ft ; the cable was still on the spool) drop cable. The collision rate was 3-4% with the shorter cable, 20% with the longer cable. The Cabletron representative suggested there might be a problem with the ie0 interface on jetson. Some of the evidence appears to support this; with the same script described above, but running only on a sparcstation with a filesystem nfs mounted from another sparcstation, collision rates are in the 2-3% range. I also tried using "spray -i -l 1500 -c 3000" instead of cp to generate network traffic. This allows using workstations which do not have local disks, and allows tuning the packet size. Copies using nfs tend to have a lot of packets of length 1500; using the default packet size for spray doesn't generate many collisions. At around 500 byte packets, large numbers of collisions start to occur. With spray and 1500 byte packets at a time when the net is very quiet otherwise, we can generate collision rates from 10 to 40% running from a sparcstation to jetson, but a maximum of 6% and generally 1-2% among the various sparcstations. Network utilization is in the neighborhood of 40% (as reported by traffic) using spray. Since the collision rate is low enough between sparcstations, it appears that the physical wiring is acceptable. We've replaced the transceiver and the cable on jetson, with no appreciable effect. Early on, we switched cpu's, and last night we tried moving the development net from ie0 to ie3. Neither had an effect, so we don't suspect the interface cards. We've also installed patch #100260-01, which is supposed to fix "misaligned frames from the ie controller during heavy traffic". It had no effect on the collision rate, and we still see misaligned frames from jetson, using a sniffer. Given the evidence above, it now appears that jetson is somehow the culprit. Every piece of hardware has been replaced or substituted: interface card, transceiver, drop cable. So it appears to be a software problem, or a generic problem with the ie interface. However, it's difficult to see why the problem only shows up on the twisted pair net. I have a call into Sun answerline, but at least on the first try, the person there had no suggestions. Network layout: The development network has only Sun equipment -- 10 sparcstation 1's, 2 SLC's, 1 4/490, and one 386i. The 386i and one sparcstation boot from local disks; the remainder boot from the 4/490, and one has local disk for /tmp and swap. All but one of the sparc machines are running 4.1. The odd one out is running 4.1.1, and the 386i is running 4.0.1. The "production" network has 3 4/470's and 2 4/490's and 2 sparcstations, all hanging from fanout boxes. On this side of the network, the script described above typically shows collision rates of 1% or less. Overall ethernet utilization is low on both sides of the net, but lower on the production network. (perhaps 5-15% average on development, and perhaps 5% on the production side) Ethernet Layout: Production network Development network ______________ aui ~10 ft | concentrator | fanout------------ 4/490 ------- |______________| |||||| ie3 ie0 | | | | | | fanout | | | | | | 3 ft |||||| distribution panel 3 4/470's | | | | | | 10 ft 2 4/490's | | | | | | 2 sparcstations punchdown block | | | | | | 3 ft | | | | | | punchdown block | | | | | | | | | | | | 200-250 ft 8 sparcstations | 2 slc's | fanout ||| aui cables 2 sparcstations 1 386i Daniel Quinlan {uunet,boulder}!chs!danq System Administrator danq%chs@boulder.colorado.edu Consumer Health Services 303/442-1111 x3124 5720 Flatirons Parkway Boulder, CO 80301 -- Daniel Quinlan {uunet,boulder}!chs!danq System Administrator danq%chs@boulder.colorado.edu Consumer Health Services 303/442-1111 x3124 5720 Flatirons Parkway Boulder, CO 80301