[comp.dcom.modems] TrailBlazer+/ALM-2 interaction with Ethernet

dplatt@coherent.com (Dave Platt) (05/17/88)

Weekend before last, I came into work on Saturday to help our site's
hardware/net guru ("Madame Server") reconfigure our thin Ethernet.
We'd been having a slow but steady increase in the number of Ethernet
errors...  typically output errors on our Sun 3/280 file server and
input errors on the 3/50 and 3/60 workstations.  The error rate had
increased to somewhere in the .1-.2% range, and was beginning to cause
occasional problems especially during periods of heavy net usage (NFS
glomming, heavy paging by diskless 3/50 workstations, and "dump
workstation file systems to the server's tape drive" sessions).  We had
replaced the server's thinnet transceiver (a Cabletron ST-500) to no
avail.

We knew that our thinnet wasn't installed entirely "according to
Hoyle"; the cables snaked through the ceiling in a couple of areas
(getting out of the server room and into the workstation bullpen), and
were hung down along the "spine" of our workstation area next to
120-volt power strips.  We weren't sure whether the primary problem was
interference from fluorescent lights, interference from the
power-strips, bad cables, or what;  Madame Server decided to tear down
the entire workstation-spine portion of the net and rebuilt it one
piece at a time, with the cables strung up on the top of the cubicles,
well away from any 120-volt power.

We started by detaching the all of the workstations save for one
diskful 3/60, and then loading down the net with a set of "spray" and
"rcp" scripts.  Sun's "traffictool" indicated 20-30% saturation on the
net...  a good deal higher than we see during any but the highest
levels of real-life activity.  Few, if any errors were to be seen.

At this point, I asked Madame Server if I should power off our Telebit
TrailBlazer modem in order to keep off-site UUCP activity from skewing
our test results or loading down our server's CPU;  she agreed.  I cut
the modem's power, and M.S. said "Whoops... we just got 16 errors in 10
seconds!".  I waited a few seconds, powered the modem back on, and
then off again... and we got another 17 errors!  This proved to be a
highly repeatable phenomenon... powering the modem off generated a
rapid burst of Ethernet output errors whenever the net was under a
substantial load from the server.

As an experiment, I disconnected the modem from its port on our ALM-2
16-line serial board, and plugged it into port A on the server's CPU
board.  I then reconfigured the /etc/ttys file to enable dialin on CPU
port A, kicked /etc/init, and tried the power-on/power-off sequence
again.  Lo and behold, the errors did not recur!

My working hypothesis at this point is that powering off the modem
generates a burst of activity on the serial port; I'm not sure whether
it's a spurt of garbage data on the RD line, or whether DSR or CD is
hopping up and down, or whether it's something else entirely.  Whatever
it is, I believe that it causes the ALM-2 board to begin generating a
swarm of interrupts and/or grabbing the VMEbus for DMA input... and, in
either case, is somehow locking out or interfering with the Ethernet
interface.  The same level of serial-port activity doesn't seem to
cause problems if the modem is attached directly to port A on the CPU
board.

Based on a very limited set of observations, I believe that uucp I/O
between the modem and the ALM-2 board does not in and of itself cause
Ethernet errors under these conditions.  I suspect, however, that the
dropping-and-raising of DTR and/or CD that occurs at the beginning
and/or end of a uucp session may cause such errors to occur.  I could
well be wrong about either or both of these hypotheses... I haven't done
enough experiments to be sure.

At this point, we've decided to leave the TrailBlazer connected to CPU
port A... it eliminates one source of Ethernet errors under some
conditions, and the CPU seems to be able to ingest 19200-baud data from
the modem much more efficiently (~ 30% of the processor when connected
to CPU port A, vs.  ~ 60% when connected to the ALM-2).  We've also
moved our hardwired link to "aimt" to the CPU-board serial ports (port
B) for the same set of reasons.  Our 2400-baud dialup modems will
remain on the ALM-2, as there's no place else to put them.  I haven't
yet experimented to see whether power-cycling these modems also causes
errors under heavy-net-load conditions.

I don't know whether this situation is a design problem in the ALM-2
(excessive VMEbus/interrupt activity), a glitch in the TrailBlazer Plus
(swarms of garbage during power-off... a death-rattle? ;-), a software
problem in the Sun's ALM-2 interface code, or a phase-of-moon problem.
If anybody else out there has seen similar happenings, I'd really like
to hear about it!

[After several hours of reconfiguring and testing our net, we seem to
 have nailed the other sources of packet-errors.  Restringing the
 cables along the tops of the workstation partitions helped, as did
 finding and replacing one cable that had a sloppily-attached
 connector, and another that looked as if it has been attacked by a
 rabid hair stylist armed with a curling iron.  We didn't have to
 restring the portion of the cable that goes through the ceiling and
 past several fluorescent lights... apparently that wasn't the
 problem.  Happiness is a clean network!]
-- 
Dave Platt                                             VOICE: (415) 493-8805
  USNAIL: Coherent Thought Inc.  3350 West Bayshore #205  Palo Alto CA 94303
  UUCP: ...!{ames,sun,uunet}!coherent!dplatt     DOMAIN: dplatt@coherent.com
  INTERNET:   coherent!dplatt@ames.arpa,    ...@sun.com,    ...@uunet.uu.net

tjt@twitch.UUCP ( T.J.Thompson) (05/18/88)

In article <4638@coherent.com>, dplatt@coherent.com (Dave Platt) writes:
> ....
> At this point, we've decided to leave the TrailBlazer connected to CPU
> port A... it eliminates one source of Ethernet errors under some
> conditions, and the CPU seems to be able to ingest 19200-baud data from
> the modem much more efficiently (~ 30% of the processor when connected
> to CPU port A, vs.  ~ 60% when connected to the ALM-2).

Isn't the ALM-2 supposed to offload any of the tty-handling from the main
CPU?  This performance seems poor.   ...Tim Thompson...ihnp4!twitch!tjt...