[comp.protocols.tcp-ip] Ethernet meltdowns

hedrick@TOPAZ.RUTGERS.EDU (Charles Hedrick) (07/08/87)

During the last week or so we have run into several oddities on our
Ethernets that I thought might interest this group.  Nothing that will
surprise any veterans, but sometimes war stories are useful to people
trying to figure out what is going on with their own net.

For several months, we have been having mysterious software problems
on one Ethernet.  THis is our "miscelleanous" network.  No diskless
Suns.  Several Unix timesharing systems, a few VMS machines, a DEC-20,
and some Xerox Interlisp-D machines.  The problems:
  - every week or so, all of our Bridge terminal servers crashed.
	When it happened, they all crashed at the same time.
  - fairly rarely, a Celerity Unix system would run out of mbufs.
  - a Kinetics Ethernet/Appletalk gateway running the kip 
	code would hang or crash (not sure which) every few days

We sent a dump of the Bridge crash to Bridge.  Celerity wouldn't talk
to us because we made a few changes to the kernel.  Kinetics swapped
hardware for us, so we knew it wasn't hardware, but we still haven't
figured out how to debug the problem.  (The author of the software
suspects the Ethernet device driver, but it's going to take us months
to learn enough about the infamous Intel Ethernet chip to find a
subtle device-level problem.  Typical known problem: packet sizes that
are a multiple of 18 bytes hang the hardware when the phase of the
moon is wrong.  How's a bunch of poor Unix hackers gonna debug a
system where the critical chip has a 1/4 inch thick bug list, which we
don't have a copy of.)  Anyway, Bridge finally came back with a
response that unfortunately I have only second-hand "We got a very
high rate of packets from two different Ethernet addresses each
claiming to be the same Internet address.  This shouldn't cause us
problems, but does.  We found the problem, and it will be fixed in the
next release."  They gave us the two Ethernet addresses and the
Internet address.  Two Celerities were claiming to be some other
machine.  So we break out our trusty copy of etherfind.  (This is a
Sun utility that lets you look at packets.  There's a fairly general
way of specifying which ones you want to see, and they will decode the
source, destination, and protocol types for IP.  We've got lots of
Ethernet debugging tools, but this is by far the most useful for this
kind of problem.)  It turns out that the Celerities have the infamous
bug that causes them to get the addresses wrong in ICMP error
messages.  Before proceeding with the war story, let me list the 
classic 4.2 bugs that lead to network problems:

1) Somebody sends to a broadcast address that you don't understand.
There are 6 possible broadcast addresses.  For a subnetted network
128.6.4, they are 255.255.255.255, 128.6.4.255, (the correct ones by
current standards) 128.6.255.255 (for machines that don't know about
subnetting), and the corresponding ones for machines that use the old
standards: 0.0.0.0, 128.6.4.0, and 128.6.0.0.  We have enough of a
combination of software versions that there is no one broadcast
address that all of our machines understand.  So suppose somebody
sends to 128.6.4.255.  Our 4.2 machines, which expect 0.0.0.0 or
128.6.0.0, see this as an attempt to connect to host 255 on the local
subnet.  Since IP forwarding is on by default, they helpfully decide
to forward it.  Thus they issue ARP requests for the address
128.6.4.255.  Presumably nobody responds.  So the net effect is that
each broadcast results in every 4.2 machine on the Ethernet issing an
ARP request, all at the same time.  This causes massive collisions,
and also every machine has to look at all those the ARP requests and
throw it away.  This will tend to cause a momentary pause in normal
processing.

2) Same scenario, but somebody has turned off ipforwarding on all the
4.2 machines.  Alas, this simply causes all the 4.2 machines to issue
ICMP unreachable messages back to the original sender.  This still
results in massive collisions, but at least this time only one machine
(the one that sent the broadcast) has to process the fallout.  That's
if everything works.  Unfortunately, some 4.2 versions have an error
in setting up the headers for the error message.  They forget to
reverse the source and destination, as I recall.

3) Somebody sends a broadcast UDP packet, e.g. routed routing
information.  Hosts that are not running routed (or whatever) attempt
to send back ICMP port unreachable.  They are supposed to avoid
doing this for broadcasts, but the test for broadcastedness in udp_usrreq 
doesn't agree with the one in ip_input, so for certain broadcast
addresses, every machine on the network that isn't running the
appropriate daemon will send back an ICMP error.  Again, lots of
collisions.  If you have a few gateawys running routed, but most
hosts not running it, you'll have network interference every 30
sec.  Then again, there are those machines where the ICMP messages
have the wrong source and destination address.

Now back to the war story.  The case I actually saw with etherfind was
caused by routed broadcasts.  Our 2 Celerities would each respond with
ICMP port unreachable.  Unfortunately, they have the bug that caused
the IP addresses in the ICMP error message to be wrong.  I think it
ended up sending packets with source address == the machine that had
sent the routed's, and destination == the broadcast address.  This
would explain why our Bridge terminal servers were seeing packets from
two different Ethernet addresses, both claiming to be a different
machine.  We had certainly been seeing spotty network response, and as
far as I can see, it went away when we fixed these problems.  As far
as we know, the Bridge terminal servers and Kinetics gateways have
both stopped crashing, and the Celerities have stopped losing mbuf's.
What we suspect is that some obscure case came up that create a
problem more serious than the one we saw with etherfind.  Note that
one of the failure modes is that certain broadcasts can lead to error
messages sent to the broadcast address.  We haven't analysed the code
carefully enough to be sure exactly what conditions trigger it, but we
suspect that the two machines may have gotten into an infinite loop of
error messages.  Since the messages would be broadcasts, everyone on
the network would see them.  This is generally called a "broadcast
storm".  The best guess is that both the Bridge and Kinetics crashes
were caused by subtle bugs in their low-level code that fail under
very heavy broadcast loads.  Probably the Celerity "mbuf leak" is
something similar.  Unfortunately, without a record of the packets on
the network at the exact time of failure, it is impossible to be sure
what was going on.  But Bridge's crash analysis seems to indicate a
broadcast storm involving the Celerities.

The fix to this is to make sure every one of your 4.2 systems has
been made safe in the following fashion:

 - turn off ipforwarding, in ip_input

 - in the routine ip_forward (in ip_input), very near the beginning
	of the routine, there is a test with lots of conditions,
	that ends up throwing away the packet and exiting.  Add
	"! ipforwarding || " to the beginning of the test.

 - in udp_usrreq, a few pages into the routine, in_pcblookup is
	called to see whether there is a port for the UDP packet.
	If not (it returns NULL), normally icmp_error is called
	to send port unreachable.  However there is a test to
	see whether the packet was send as a broadcast.  If so,
	it is simply discarded.  That test must agree with the
	test for broadcastedness in ip_input.  This seems to
	differ in various implementations, so I can't tell you
	the code to use.  One common bug is to forget that
	ip_input recognizes 255.255.255.255 as a broadcast
	address.  It normally does this in a completely different
	place than it tests for other broadcast addresses.
	So you may be able to add something like
	"ui->ui_dst.s_addr == -1 || " to the test in udp_usrreq.

These apply to 4.2.  4.3 probably doesn't need them all, and may not
need any of them.

Now for the second war story.  Our computer center recently bought a
few diskless Suns for staff use.  Until then, all diskless Suns had
been on separate Ethernets separated from our other Ethernets by
carefully-designed IP gateways.  However the computer center figured
that a small number of these things wasn't going to kill their
network, so they connected them to their main Ethernet.  On it is a
VAXcluster (2 8650's), a few 780's, some terminal servers and other
random stuff, and level 2 bridges (Applitek broadband Ethernet bridges
and Ungerman-Bass remote bridges) to more or less everywhere else on
campus.  Since they were still setting up the configuration, it isn't
surprising that a diskless Sun 3/50 got turned on before its server
was properly configured to respond.  Nobody thought anything of this.
We first discovered there were problems when we got a call from
somebody in a building half a mile away that his VAX was suddenly not
doing any useful work.  Then we got a call from our branch in Newark
saying the same thing about their VAXes.  Then someone noticed that
the cluster was suddenly very slow.  Well, it turns out that the Suns
were sitting there sending out requests for their server to boot them.
These were broadcast TFTP requests.  Unfortunately, they used a new
broadcast address, which the Wollongong VMS code doesn't understand.
So VMS attempted to forward them.  This means that it issued an ARP request
for the broadcast address.  There is some problem in the Wollongong
TCP that we don't quite understand yet.  It seems that whenever there
are lots of requests to talk to a host that doesn't respond to ARP's,
the whole CPU ends up being used up in issuing ARP's.  For example,
when something goes wrong with our IBM-compatible mainframe (which is
used to handle most of the printer output for the cluster, using Unix
lpd implementations on both systems) the VAX cluster becomes unusable.
As far as we can tell, it is spending all of its time trying to ARP
the mainframe.  In this case, the same phenomenon was triggered
by the attempt to forward broadcast packets.  Since our VMS systems
mostly sit on networks that are connected by level 2 bridges instead
of real IP gateways, broadcasts go throughout the whole campus, and
essentially every VMS system is brought to its knees.  Unfortunately,
there is no way we can fix this.  The Sun broadcast is being issued
by its boot ROM, which is the one piece of software we aren't equipped
to change, and we don't have source to the Wollongong code.  So the
solution for the moment is to put the Suns on a subnet that is
safely isolated behind an IP gateway.  This fixes the problem, because
IP gateways don't pass broadcasts, or they only pass very carefully
selected ones.

CERF@A.ISI.EDU (07/13/87)

Charles,

In case I haven't said it earlier, I just want you to know that your
concrete contributions to the Internet technology and practical comments
on various pieces of software and hardware have been extremely helpful.

Keep up the good work (fight the good fight?).

Vint Cerf

LYNCH@A.ISI.EDU (Dan Lynch) (07/14/87)

Charles,  I just had a chance to digest your "meltdown" note.  I have
seen no response on the net suggesing any cures for your ailments.
Nor do I expect any...  You describe the "state of the art".  

Ane they say there is little need for testing of TCP/IP!!!

I am extremely curious if the testing planning that is going on
at COS for the ISO stack(s) is looking at the kind of grief
that you are currently experiencing?  ON the other hand don't
most of your problems come from multiple interpretations of
the "broadcast" address?

Dan
-------

hedrick@TOPAZ.RUTGERS.EDU (Charles Hedrick) (07/14/87)

Well, I know the cure for broadcast storms, and I think plenty of
other people do as well.  I mostly give it in the message.  You simply
have to be very careful to do validity checking before forwarding
packets or generating ICMP error messages.  As far as I can tell, 4.3
is fairly good, so it's mostly a matter of waiting for vendors to
catch up to 4.3.  All the Unix vendors we deal with have either just
released 4.3-based network code or are about to do so.  I agree with
your implication that validation of TCP/IP implementations would be
useful.  I understand that it is hard to design a test setup that will
make sure a TCP follows all the best performance guidelines.  But it
is not at all hard to make sure that an IP is designed so it won't
contribute to broadcast storms.  

My first inclination is to say that it will be easy for ISO to avoid
this problem.  It isn't hard to come up with a set of implementation
guidelines that avoid broadcast storms.  What really triggered this
was the Internet changing its idea of the broadcast address.  I mean,
it shouldn't have been hard to forsee this problem when 0 was changed
to -1.  (On the other hand, subnetting probably required enough of a
change that things would have broken anyway, so there might have been no
way to avoid problems.)

However this may be giving too much credit to people.  The people who
will be implementing ISO are exactly the same people who have ignored
the TCP/IP implementation guidelines.  If people can do IP's that
don't respond to ICMP echo, presumably they can find ways to mess up
ISO as well.  It seems to me that ISO's equivalent of the broadcast
address change is going to be the incredibly complex address
structure.  It seems likely that few people will implement every
possible address format.  (Indeed probably they couldn't if they
wanted to.)  My intuition says that when different implementations
implement different sets of address formats, there are bound to be
some interesting interactions somewhere.  And with the worldwide PTT
network built into the addressing structure, I'll bet at some point
we'll manage to see some sort of storm that involves several
continentne 

dupuy@amsterdam.columbia.edu (Alexander Dupuy) (07/15/87)

We once had a similar problem with a broadcast storm started by a diskless
Sun-3 trying to boot without a server.  Although you are correct when you say
that the boot broadcast address is hardwired in the Sun-3 PROMs, there is a
workaround, at least if you aren't on a class A or B network with subnets
(which is the case here at Columbia, and probably at Rutgers, *sigh*).

When a Sun 3 (diskless or otherwise) tries to boot, it looks in the EEPROM on
the CPU board for a default boot device.  If none is found, it takes the first
bootable device it finds, in the order it looks for them: disk, tape, ethernet.

A device spec looks something like this: ty(#,#,#), where ty is the board type
and the three numbers are the controller #, unit #, and partition #.  The
defaults for these are all 0.  In the case of an ethernet device, the unit # is
actually the last component of an internet host number, with 0 signifying the
broadcast address (which is all ones, not zeros).

When a fresh-from-the-factory diskless Sun-3 boots, the PROM, not finding
anything better than an ethernet device to boot from, starts a TFTP boot from
the device ie(0,0,0) or le(0,0,0), which can result in network meltdown if no
server responds (and sometimes even when one does).

However, if the server is at address 128.59.0.110 (say) you can set the default
boot device to be ie(0,110,0), and the only broadcasts which the booting sun
will generate will be the initial RARP and ARP requests that can be answered by
any machine, not just the server.

The catch in this is that if the server is at address 128.59.16.110, the host
part of the address (by the pre-subnetting rules, anyhow) is the number 4206,
and the largest possible unit number is 255.  One hopes that Sun will someday
support subnets in the boot PROM, so that this is no longer a problem; in the
meantime, one might consider using subnet 0 (if that's legal) for Sun diskless
clients and servers.

@alex
---
arpanet: dupuy@columbia.edu
uucp:	...!seismo!columbia!dupuy

karn@FALINE.BELLCORE.COM (Phil R. Karn) (07/16/87)

As I've said before, I think the notion of an "IP broadcast address" is
utterly meaningless. Broadcasting is a notion limited solely to certain
subnets; the Internet itself has no notion of broadcasting. (I'll ignore
the experimental multicast work for the time being).

Therefore it is completely bogus to look at anything other than the
SUBNET destination address when determining whether an incoming packet
is a broadcast or not. Getting an Ethernet packet with all 1's in the
destination field is both necessary and sufficient to label it as a
broadcast packet that must not be forwarded or answered with an ICMP
message even if the type field says it's an IP datagram. The IP address
field should be completely ignored; therefore it is irrelevant to even
specify a "standard" IP broadcast address.

It is arguable whether broadcast packets should even use UDP/IP at all,
although I suppose it is handy because of the port multiplexing and
checksumming provided by UDP.

Phil

gross@GATEWAY.MITRE.ORG (Phill Gross) (07/16/87)

> The people who
> will be implementing ISO are exactly the same people who have ignored
> the TCP/IP implementation guidelines.  If people can do IP's that
> don't respond to ICMP echo ... 

At least in theory, this is not the way it is supposed to work.  A principal 
reason for eventually converting to ISO protocols is that they will be 
off-the-shelf, conforming products freely available from all vendors, 
where `conformance' is meant to imply both adherence to the standard 
and interoperation between vendor implementations.

braden@BRADEN.ISI.EDU (Bob Braden) (07/17/87)

Phil,

Your interpretation of the IP broadcast address situation is at variance
from the "official" interpretation in RFC-1009.  The issue is one on
which reasonable and informed Internetters can and have disagreed,
on this mailing list and elsewhere, and there probably is not a "right"
answer.  However, we did make a decision, and gave the Internet community
plenty of chance to read and comment on RFC-1009 before it was
published. As a matter of fact, the IP broadcast address was not
an issue on which anybody made a comment (and there WERE plenty of
comments and commenters on the RFC-985+ draft!).

I suspect that you will now go read RFC-1009, and be suitably outraged.
All outraged messages to Jon Postel or myself on the contents of RFC-1009
will be read, considered carefully, and if the arguments are irrefutable,
will influence a future revision to the RFC.

   Bob Braden


PS I think you make a mistake by dismissing the IP multicasting mechanism.
A 4.3BSD implementation is available today for anyone who wants it.

Mills@UDEL.EDU (07/18/87)

Phil,

Read RFC-985 and its recent successor. The IP address space is not flat.
There are architected IP broadcast (and other) addresses, whether bogus
or not. The intent of the martian filter is to kill "local" broadcasts
before they escape the local net. This is a pragmatic consideration
based on some years of horror living without it.

Dave