[comp.protocols.tcp-ip] Ethernet problems induced by bogus ICMP Address Mask Reply

mullen@NRL-CSS.ARPA (Preston Mullen) (08/08/87)

Here's a new one to add to the list of things that can induce broadcast
storms and other serious problems.  It involves an unpleasant interaction
between SunOS 3.3 and 3.4 and Wollongong's WIN/VX software version 3.0.
(Ironically, this problem was noticed when the Suns and VAXes were
upgraded to what is generally considered to be better networking
software.)

When a diskless Sun 3 workstation running SunOS 3.3 or 3.4 boots, at
some point it sends out a broadcast ICMP Address Mask Request.  This is
in accordance with RFC950; unfortunately, an incorrect reply from any
machine on the network can be accepted by the workstation, and some
incorrect masks can induce the workstation to start sending all packets
as Ethernet broadcasts, which instantly leads to a broadcast storm.

If this happens, the workstation will probably fail to finish
booting completely, usually stopping during the NFS mount with
"RPC: not registered" messages, but sometimes sooner.  Also, the
workstation may itself then generate an incorrect reply (sent as an
Ethernet broadcast) to a subsequent ICMP Address Mask Request from
some other machine, thus spreading the virus.  Other symptoms that
may be observed include extremely sluggish operation of diskless
machines and "no carrier" and "Ethernet jammed" notices from Suns
(aside from those generated during broadcast storms).

This happened here on a non-subnetted Class B network when the 3.0
release of Wollongong's IP/TCP software was installed on two VAXes
running VMS.  Both VAXes replied to a Sun's ICMP Address Mask Request
with network masks of 0000FFFF instead of the correct FFFF0000.  The
diskless Sun gullibly swallowed this and began to encapsulate every IP
packet in an Ethernet broadcast packet.  Even after we isolated our
network from the offending VAXes, any Sun in this state would respond
with the bogus netmask to a broadcast from another Sun and thus
continue the problem.

The solution was to disconnect the offending VAXes from the network,
halt ALL the diskless Suns, then reboot them.  After that, we could let
the VAXes back on the network, but it was not really safe since any
diskless Sun reboot would start the cycle over again.

The people who own the VAXes that started the problem have told me
that the problem is really in the Wollongong software (i.e., not a
configuration error) and is a holdover from some "broken 4.3bsd code".
Evidently there are also problems with setting the address mask.
They report that Wollongong has sent them a fix.

==> Potential users of Wollongong 3.0 software should get a fix from
Wollongong before running 3.0 on a network where a machine might
broadcast an ICMP address mask request.

==> Sun should make their networking code smarter.  There is no way a
netmask of 0000FFFF could ever be valid, since it must include the
normal network part of the address as well as the subnet part.  The
Suns should have ignored the faulty replies instead of being driven
berserk by them.  I've reported this to Sun.  The problem doesn't
affect Suns running 3.2 or earlier releases since those releases have
no subnetting support.

It would probably be better if the diskless Sun did not broadcast the
Address Mask Request but instead sent it directly to the server.
I think that the request is broadcast after the ifconfig of the
diskless Sun's Ethernet interface (anyway, an "address mask set to
FFFF0000" report appears on the console quite a while before the
problem shows itself).  If that is so, then it's not clear why an ICMP
Address Mask Request needs to be sent at all, since the network mask
can be specified in the ifconfig in the rc.boot file.

By the way, our class B network is partitioned by level 2 bridges
(Digital DEBETs).  Needless to say, they passed every bad packet and
broadcast right through.  (The VAXes were on the other side of a bridge
from my Suns.)  Yep, I'll be moving my stuff behind an IP gateway now.

Many, many thanks to Van Jacobson for 'tcpdump', which proved
instrumental in tracking this down.  I hope Sun will let Van
release the source code for tcpdump.

	Preston Mullen
	Computer Science and Systems Branch
	Information Technology Division
	Naval Research Laboratory
	Washington DC 20375-5000

P.S.    Why do diskless Sun workstations running SunOS 3.4 broadcast an
	ARP request for IP address 0.0.60.216 very early in the boot
	sequence?  (The ARP packet asks that replies go to 0.0.60.216.)
	This appears to be wired into the Sun networking software.
	It's harmless enough, but it should not be there.  Maybe
	someone forgot to take out some debugging code.

hedrick@TOPAZ.RUTGERS.EDU (Charles Hedrick) (08/09/87)

I believe your ARP request for 0.0.60.216 is used during the boot
sequence, at a point when the Suns don't know their Internet address.
They seem to use their serial number as an address.  You'll also find
such addressing in the routing tables in certain versions of SunOS.
I suggest that you take 60 * 256 + 216, and see if you don't have
a Sun with that serial number.

timk@NEWTON.NCSA.UIUC.EDU (Tim Krauskopf) (08/10/87)

We encountered the exact same problem (diagnosed with tcpdump) between our
diskless Sun 3/50 (SunOS 3.3) and a Vaxstation (TWG under VMS).  The icmpmask
response from TWG is backwards.

Easy work-around while waiting for bug fixes from TWG:
  Install the netmask parameter in the rc.boot files (ifconfig lines) of as
  many diskless Suns as you feel like.
  Each of the Suns with the mask installed will operate correctly.
  For the rest of the machines, the russian roulette has much better odds:
  When the ICMP request goes out, there are several Suns waiting to respond
  and they are faster at responding than the VAXen.  Whoever gets there first
  wins.

Isn't this fun?

Tim Krauskopf
National Center for Supercomputing Applications
University of Illinois
timk%newton@uxc.cso.uiuc.edu

jerry@twg.arpa ("Jerry Scott") (08/14/87)

On date, 8 Aug 87, Preston Mullen <mullen@nrl-css.arpa> wrote:

>>==> Potential users of Wollongong 3.0 software should get a fix from
>>Wollongong before running 3.0 on a network where a machine might
>>broadcast an ICMP address mask request.

A fix may be obtained for this problem by using anonymous ftp to
host twg.arpa (26.5.0.73) and obtaining image ip_icmp.o.  The file
is in binary/record format and should be transferred from a Wollongong
host running either release 2.3 or 3.0.

Also, image named472.sav, can be obtained at this time.  This is
the new version of the bind name server from D. Kingston, et al.
The image is in VMS Backup format and should be obtained using
binary/record mode from a Wollongong host.

Regards,

Jerry Scott
------