dplatt@coherent.com (Dave Platt) (05/17/88)
Weekend before last, I came into work on Saturday to help our site's hardware/net guru ("Madame Server") reconfigure our thin Ethernet. We'd been having a slow but steady increase in the number of Ethernet errors... typically output errors on our Sun 3/280 file server and input errors on the 3/50 and 3/60 workstations. The error rate had increased to somewhere in the .1-.2% range, and was beginning to cause occasional problems especially during periods of heavy net usage (NFS glomming, heavy paging by diskless 3/50 workstations, and "dump workstation file systems to the server's tape drive" sessions). We had replaced the server's thinnet transceiver (a Cabletron ST-500) to no avail. We knew that our thinnet wasn't installed entirely "according to Hoyle"; the cables snaked through the ceiling in a couple of areas (getting out of the server room and into the workstation bullpen), and were hung down along the "spine" of our workstation area next to 120-volt power strips. We weren't sure whether the primary problem was interference from fluorescent lights, interference from the power-strips, bad cables, or what; Madame Server decided to tear down the entire workstation-spine portion of the net and rebuilt it one piece at a time, with the cables strung up on the top of the cubicles, well away from any 120-volt power. We started by detaching the all of the workstations save for one diskful 3/60, and then loading down the net with a set of "spray" and "rcp" scripts. Sun's "traffictool" indicated 20-30% saturation on the net... a good deal higher than we see during any but the highest levels of real-life activity. Few, if any errors were to be seen. At this point, I asked Madame Server if I should power off our Telebit TrailBlazer modem in order to keep off-site UUCP activity from skewing our test results or loading down our server's CPU; she agreed. I cut the modem's power, and M.S. said "Whoops... we just got 16 errors in 10 seconds!". I waited a few seconds, powered the modem back on, and then off again... and we got another 17 errors! This proved to be a highly repeatable phenomenon... powering the modem off generated a rapid burst of Ethernet output errors whenever the net was under a substantial load from the server. As an experiment, I disconnected the modem from its port on our ALM-2 16-line serial board, and plugged it into port A on the server's CPU board. I then reconfigured the /etc/ttys file to enable dialin on CPU port A, kicked /etc/init, and tried the power-on/power-off sequence again. Lo and behold, the errors did not recur! My working hypothesis at this point is that powering off the modem generates a burst of activity on the serial port; I'm not sure whether it's a spurt of garbage data on the RD line, or whether DSR or CD is hopping up and down, or whether it's something else entirely. Whatever it is, I believe that it causes the ALM-2 board to begin generating a swarm of interrupts and/or grabbing the VMEbus for DMA input... and, in either case, is somehow locking out or interfering with the Ethernet interface. The same level of serial-port activity doesn't seem to cause problems if the modem is attached directly to port A on the CPU board. Based on a very limited set of observations, I believe that uucp I/O between the modem and the ALM-2 board does not in and of itself cause Ethernet errors under these conditions. I suspect, however, that the dropping-and-raising of DTR and/or CD that occurs at the beginning and/or end of a uucp session may cause such errors to occur. I could well be wrong about either or both of these hypotheses... I haven't done enough experiments to be sure. At this point, we've decided to leave the TrailBlazer connected to CPU port A... it eliminates one source of Ethernet errors under some conditions, and the CPU seems to be able to ingest 19200-baud data from the modem much more efficiently (~ 30% of the processor when connected to CPU port A, vs. ~ 60% when connected to the ALM-2). We've also moved our hardwired link to "aimt" to the CPU-board serial ports (port B) for the same set of reasons. Our 2400-baud dialup modems will remain on the ALM-2, as there's no place else to put them. I haven't yet experimented to see whether power-cycling these modems also causes errors under heavy-net-load conditions. I don't know whether this situation is a design problem in the ALM-2 (excessive VMEbus/interrupt activity), a glitch in the TrailBlazer Plus (swarms of garbage during power-off... a death-rattle? ;-), a software problem in the Sun's ALM-2 interface code, or a phase-of-moon problem. If anybody else out there has seen similar happenings, I'd really like to hear about it! [After several hours of reconfiguring and testing our net, we seem to have nailed the other sources of packet-errors. Restringing the cables along the tops of the workstation partitions helped, as did finding and replacing one cable that had a sloppily-attached connector, and another that looked as if it has been attacked by a rabid hair stylist armed with a curling iron. We didn't have to restring the portion of the cable that goes through the ceiling and past several fluorescent lights... apparently that wasn't the problem. Happiness is a clean network!] -- Dave Platt VOICE: (415) 493-8805 USNAIL: Coherent Thought Inc. 3350 West Bayshore #205 Palo Alto CA 94303 UUCP: ...!{ames,sun,uunet}!coherent!dplatt DOMAIN: dplatt@coherent.com INTERNET: coherent!dplatt@ames.arpa, ...@sun.com, ...@uunet.uu.net
tjt@twitch.UUCP ( T.J.Thompson) (05/18/88)
In article <4638@coherent.com>, dplatt@coherent.com (Dave Platt) writes: > .... > At this point, we've decided to leave the TrailBlazer connected to CPU > port A... it eliminates one source of Ethernet errors under some > conditions, and the CPU seems to be able to ingest 19200-baud data from > the modem much more efficiently (~ 30% of the processor when connected > to CPU port A, vs. ~ 60% when connected to the ALM-2). Isn't the ALM-2 supposed to offload any of the tty-handling from the main CPU? This performance seems poor. ...Tim Thompson...ihnp4!twitch!tjt...