[mod.protocols.tcp-ip] On routing freedom and statutes of liberty -or- Ether is a gas

mills@DCN6.ARPA (07/07/86)
Folks,

Don't start the following story unless you enjoy solving puzzles and have a
few minutes to study and reflect on the issues. Be advised it is highly
technical, not without personal bias and may leave some of you with elevated
cranial energies. I especially would like our ANSI/ISO protocol designers to
take this thing into their committees and spread mischief (Overheard in X3.S3:
"Ya mean those Internet buzzards are doing THAT!?").

The cast of characters includes the NSFnet Backbone, which is now being
installed between six Supercomputer sites, plus a bunch of Ethernets at those
sites. Some of the sites are also interconnected via the USAN net, which uses
a multiple-access Vitalink/TransLAN satellite channel, and some are connected
to the ARPAnet via the Ethernets and other nets and gateways. The Backbone
gateways, as well as some of the USAN and ARPAnet gateways involved, consist
of LSI-11 "fuzzball" systems, which are reasonably good players of the ARP and
ICMP game, as well as run their own routing algorithm.

The following schematic shows the configuration of these swamps expected by
late August. All of the nodes and lines shown are already installed, except
for some Backbone line-interface equipment. All of the non-Backbone nodes
have been in operation for some time, while three of the Backbone nodes (SDSC,
NCAR, Cornell) are presently interconnected and in operation. Only the
fuzzballs are shown, since this discussion concerns primarily them; however,
you should recognize there are lots and lots of other hosts on these nets and
that some nets include many different media besides Ethernets.

			  +---ARPAnet
			  |
	====0====	==0===0==	====0====
   192.17.47|	    128.84.253|	  128.121.50|		===== Ethernet
	+---+---+	+-----+-+	+---+---+	----- serial line
	|	|  56	|	|  56	|	|	      (speeds in kbps)
	| NCAR	+-------+  CTC	+-------+ JVNC	|
	|	|	|	|	|	|
	+-+---+-+	+-------+	+---+---+
	  |   |	    56			    |
	56|   +-------------+		    |56
	  |		    |		    |
	+-+-----+	+---+---+	+---+---+	+-------+
	|	|  56	|	|  56	|	|  56	|	|
	| SDSC	+-------+ UIUC	+-------+  CMU  +-------+  UMD	|
	|	|	|	|	|	|	|	|
	+---+---+	+---+---+	+-----+-+	+---+---+
  192.12.207|	    192.17.5|	      128.2.49|	     128.8.1|
	====0====	====0====	==0===0==	==0=0=0==
					  |		  |   |
NSFnet Backbone				  +---ARPAnet	  |   +---ARPAnet
 . . . . . . . . . . . . . . . . . . . . . . . . . .	  |   |
FORDnet		    . UMICHnet			    .	  |   +---MILnet
		    .				    .	  |
	+-------+   .	+-------+	+-------+   .	+-+-----+
	|	|  4.8	|	| 120	|	|   .	|	|
	|FORD14	+-------+UMICH3	+-------+USAN-GW|   .	| UMD1	|
	|	|   .	|	|	|	|   .	|	|
	+-----+-+   .	+-----+-+	+---+---+   .	+---+---+
       128.5.0|	    .	35.1.1|	    192.17.4|	    .	    |
	==0===0==   .	==0===0==	====0====   .	    |
	  |	    .	  |		  USAN	    .	    |
	+-+-----+   .	+-+-----+	(9 sites)   .	    |
	|	|   .	|	|		    .	    |
	| FORD1	|   .	|UMICH1 |		    .	    |4.8
	|	|   .	|	|		    .	    |
	+---+---+   .	+---+---+		    . UMDnet|
 . . . . . | . . . . . . . | . . . . . . . . . . . . . . . | . . .
DCnet	    |9.6	    |9.6			    |
	+---+---+	+---+---+			    |
	|	|	|	|			    |
	| DCN8	|	| DCN5  +---------------------------+
	|	|	|	|
	+---+---+	+---+---+
	    |	 128.4.0    |
	====0=======0=======0====
		    |
		+---+---+
		|	|
		| DCN1	+---ARPAnet
		|	|
		+-------+

The players are:

NCAR	National Center for Atmospheric Research
CTC	Cornell Theory Center (hardware integration and net operator)
JVNC	John von Neumann Center
SDSC	San Diego Supercomputer Center
UIUC	University of Illinois/University of Chicago (project management)
CMU	Carnegie-Mellon University
UMD	University of Maryland (software integration)
UMICH	University of Michigan (installation and test)
DC	Fuzzball creche in Vienna, VA
FORD	Ford Scientific Research Labs in Dearborn, MI
									
At the moment, all of the ARPAnet/MILnet gateways except DCN1 (aka
DCN-GATEWAY) include only their own directly-connected client nets in EGP
updates sent to core gateways. DCN1 temporarily includes some of the other
nets for debugging and test purposes. Therefore, traffic for the other nets
must wander to DCN1 and swamp-fuzzball trail to USAN and Backbone trunks (UMD
is presently not operational). Eventually, Backbone connectivity may be
provided to the ARPAnet by one or more gateways at CMU, CTC or UMD as well.
Also at the moment, the Ethernets at some Backbone sites are connected via
USAN to each other and via USAN-GW at UMICH to the Interworld. As you can see,
there will be a lush supply of routes available, with many sites enjoying
connectivity via the Backbone, USAN and ARPAnet paths simultaneously.

Believe it or not, traffic actually flies these airways, although sometimes
landing in very strange airports. The fuzzballs shown operate an adaptive
routing algorithm which selects primary routes based on minimum delay and can
select alternate routes based on IP type-of-service field and other factors.
Casual observation of the DCnet Ethernet reveals there may be a lot of
local-site problems yet to solve. For instance, I spotted two hosts on Cornell
local nets working each other via the DC Ethernet (!?!) the other day.
Ya gotta see to believe.

The ring of hosts and Ethernets at DCnet, UMICHnet and FORDnet has been a
valuable prototyping facility, which is its primary service function.
Alternate routing in case of failure requires ICMP Redirect and ARP functions
to work properly, both in the fuzzballs and in other network hosts, which are
represented by a wide variety of VMS, Unix and related systems. A problem in
this area is what prompted this message.

Recently, Hans-Werver Braun of UMICH and I endured scary experiences while
teaching Etherbunnies, fuzzygators and other strange creatures to swim in
these swamps. Especially enlightening was USAN, with its nine rambunctious
Ethernets, all piled onto the same channel and babbling everything from DECnet
and XNS mumbles to Unix rwho shrieks zipping about like lost cosmic particles.
It took us two weeks at DefCon 5 to harden the fuzzball silos (which added a
couple of dB of their own routing broadcasts) and make space safe for Backbone
debug and test. Some of the lessons learned have already been broadcast for
public enjoyment and may even have medicinal value.

Additional observations are herewith submitted for your entertainment,
education and as a basis for comments and suggestions. What I hope we all get
out of this exercise is not so much a blueprint for how to deal with the
incredibly complicated brushfires we know are circling the horizon (to quote
Dave Clark), but rather as an experiment and proof-of-concept suggesting
generic issues that need further study and resolution before we send out the
fire brigade.

Multiplexed Ethernets

Most of the interesting lessons were learned on USAN, which radio amateurs
will recognize as similar to the pileup when Pitcairn Island shows up on 20
meters. The USAN design was never intended to handle the bedlam of nine
simultaneous flocks of squawking rhos, rips and other broadcast honkers, much
less the blatant ICMP squawks from the poor creatures that don't understand
them. The problem is, of course, that those of us who jimmied the Ethernet
protocols while draining our own swamps never thought of making a single cable
safe for multiple nets and subnets at the same time. Hans-Werner and I found
it useful to forget that the cable is normally protected from crosstalk
between neighbor nets and pretend it is a willy-nilly-access packet-radio
channel instead.

You may not like this model and prefer a much more carefully regimented and
regulated approach. We would like to understand what-if first and learn all we
can before the wires are insulated and the lessons have to be relearned under
meltdown conditions when the insulation wears off somewhere. Even if we agree
multiplexed Ethernets are beyond the pale, this exercise may reveal how to
harden the hosts and gateways against rogues (and jammers) that might occur
from time to time. Therefore, we simply connected everything up and let the
packets fly.

Hans-Werner and I are still catching our breath, some wheezes of which can be
found in the following. We are putting together an almanac, which may appear
as a footnote to RFC-985, to teach lessons something like the following:
Multiplexed Ethernets are extremely complex and delicate, but may represent a
useful solution in exceptional cases. If you are silly enough to contemplate
such folly, do it in the following way...

Type-of-Service Routing

Our research community has been stalking the type-of-service routing issue for
awhile now and may have fallen in the wrong pits. It turned out to be easy for
the fuzzballs to take a first cut at this simply by extending the address
fields in the routing tables to include the IP type-of-service field
(actually, the extension consists of a single octet, with each bit
corresponding to each of the eight combinations of the Throughput, Delay and
Reliability bits).

The hard part begins when you try to make ARP, ICMP error messages and ICMP
Redirects work in that model. The goal would be to maintain possibly separate
routes for every combination of type-of-service specification. The main
problem is that the packet formats are not rich enough to distinguish this
information to the detail required. In addition, much of the existing host
software must be changed in nontrivial ways (e.g. the ARP cache gets an extra
octet, etc.). The fuzzball routing data structures already include such
provisions, but the algorithms have not been evolved to exploit them yet.

Alternate Routes

It was our intention to explore the consequences of trying to provide
alternate routes should something quit and where not all paths were monitored
by the fuzzball routing algorithm. For instance, if a backbone trunk were
broken, it might be possible to route out the ARPAnet and back in again
(please table administrative discussion - we just want to explore how the darn
thing might work, not whether it would be allowed or not).

As an experiment, The fuzzballs were fiddled so that, if an internal link
controlled by the fuzzball routing algorithm broke, traffic could be directed
via a designated backup path. This was tested in the DCN5 - UMD1 link, with
designated backup path via the ARPAnet/MILnet gateways on each net, which is
ordinarily controlled by EGP and the core system. The effect is that, when the
DCN5 - UMD1 link is up, traffic flows via it, but, when the link is down,
traffic flows the long way around via ARPAnet, MILnet and thousands of
gateways.

This was a cute experiment and worked just fine; however, its success depends
on carefully engineered tables and delicate assumptions about the
functionality and dynamics of both the fuzzball and EGP algorithms. More study
and experiment is needed here.

Subnets

All of the class-B nets shown in the schematic are subnetted, with the third
octet interpreted as the subnet number. The class-A UMICHnet is subnetted as
well. Good subnetting practice is to avoid the use of subnet-number fields
containing all zero bits or all one bits, since this can lead to confusing 
interpretation of broadcast scope, for example. DCnet and FORDnet fall victim
to this observation. I can't believe for a minute that vendor products without
subnetting capability will have a long life in this era.

Broadcast and Don't-Know Addresses

The Internaut Handbook specifies that the local-subnet address with all ones
in the host portion should be interpreted as "broadcast" and that the address
with all zeros should be interpreted as "don't know." Under these
interpretations, the former is legal only in destination-address fields, while
the latter (used by a diskless workstation while RARPing for its address, for
example) is legal only in source-address fields. Where subnets are in use, the
scope of this interpretation extends only to the given subnet, if the
subnet-number field is neither all zeros or all ones, and to the entire
network of subnets if either or all zeros or all ones.

We observed numerous abuses of this model, including the use of zeros as the
broadcast address (older bsd Unix) and various tangles with broadcasting in a
subnet environment. In point of fact, it doesn't matter which convention is
followed in a particular subnet, as long as all hosts and gateways on that
subnet understand which is in use, as well as the subnet mask, and agree never
to propagate either local-broadcast or local-don't-know datagrams beyond the
gateways. The same consideration applies at the network-of-subnets level, of
course.

For reasons that should be evident from the following, I believe the use of
zeros and ones in the subnet-number field and the above interpretation should
be avoided in favor of explicit broadcast agents. One implication of this
model is that broadcasts would never be propagated by a gateway. I further
believe that a very careful coupling should be maintained between the
semantics of the Ethernet broadcast/multicast addresses and IP broadcast
addresses; otherwise, the implementation is forced not only to carry all kinds
of wierd semantics up the protocol stack, but its error detection is seriously
handicapped as well.

A paranoid receiver may well check that, when an IP packet with an Ethernet
broadcast/multicast destination address arrives, the destination-address field
must contain the IP subnet-broadcast address or the packet is discarded. This
would have the effect of disallowing random Ethernet broadcasts to designated
IP destinations if the sender had a broken or unimplemented ARP, for example.
It would also disallow cross-net routing broadcasts, such as the fuzzballs use
to manage USAN routing. More thought is needed here.

Broadcasting Semantics

Nothing we found exhibited stranger behavior than the broadcast semantics of
the various Ethernets. The most disruptive thing by far was the tendency of
some receivers to "helpfully" relay broadcast packets with destination IP
address other than the receiver to their "intended" destination. An innocent
rwho from a host on one subnet of a multiplexed cable ignites instant abuse of
the gateways if the hosts on another subnet do this. In extreme cases the
network can fall into a debilitating, oscillatory state (called meltdown),
where the entire cable bandwidth is consumed by these packets.

A general principle of nuclear engineering is that reactor meltdown is
possible only if more energy gazouta the reactor than gazinta it. Ethernet
meltdown cannot occur if it can be gauranteed that no more than one gazouta
packet can ever be produced by a single gazinta packet at a node. Thus, a
forwarded broadcast packet (hereby named a Chernobyl packet - you first heard
it here) can produce meltdown only if more than one receiver is involved. Of
course, if a Chernobyl packet is assigned an Ethernet broadcast address, the
meltdown would occur within milliseconds.

I believe no receiver (host or gateway) should ever forward broadcast
packets onward to a subsequent destination, unless the receiver is an
explicitly designated broadcast agent which explicitly understands and
maintains the spanning trees and routing algorithms necessary to reach the
intended destination without meltdowns.

What are the groundrules of a broadcast service? Those such as rwho, rip and
fuzzball routing do not involve an explicit reply from each of possibly many
receivers. Those such as ARP, RARP and ICMP Address Mask may produce a meteor
shower of responses if multiple receivers can respond. In some cases where
efficiency can be sacrificed for reliability, meteor showers may be
acceptable, but in others this would be disruptive. A case might be made for 
datagram services like the above and maybe others, but this would be silly
for connection-oriented services. Experiments with broadcast TCP service and
unhardened fuzzballs led to hilarious scenarious leading to marginal meltdown,
suggesting that a multiple-destination semantics may have to be carried up the
protocol stack anyway, at least for error detection.

ICMP Error Messages

The Internaut Handbook suggests ICMP error messages should be returned to the
ultimate source if a datagram cannot be delivered to its intended destination.
Some hosts interpret this literally and return ICMP error messages if the
protocol or port fields do not match a service provided by the receiver. We
discovered this quickly in the case of the fuzzball routing algorithm, which
broadcasts Hello messages on protocol 63 (private use) from time to time, each
one creating a shower of ICMP error messages.

I believe no receiver (host or gateway) should ever return an ICMP error
message unless it can determine with fair reliability that any other copies
that might be wandering around the network will be routed by that receiver and
that the same ICMP error message would be produced in each case. This is a
general statement and applies to scenarios other than broadcast. One
implication, for example, is that ICMP error messages must be considered
non-deterministic, since one path can be temporarily stoppered-up while
the routing algorithm is thrashing and duplicates are successfully negotiating
another path.

With respect to broadcast, ICMP error messages should never be sent in
response to a received broadcast packet. Another way of looking at it is that
no message should ever be sent if its semantics would involve the broadcast
address as the source address (IP or Ethernet) in the packet.

Subnet Masks

No gateway implementation known to me (except the fuzzball as of today)
supports the ICMP Address Mask messages described in RFC-950. If a gateway
joins two subnets, it has to know which subnet the request is for, so it can
determine the proper mask. If the requestor knows its own address, it can
broadcast the request and the gateway(s) can determine which subnet from the
source IP address of the request. If it does not know its own address, it must
use RARP to find it (the alternative suggested in RFC-950 is simply too
bizzare to contemplate in this environment).

This is the only known case that requires broadcast of an ICMP message, which
not only complicates the implementations, but suggests that maybe something is
broken in the model. The real problem is that the ARP/RARP semantics are
simply not rich enough and should be expanded to include the mask in the first
place. For instance, a RARP reply should include not only the specific host
address, but also the address mask associated with its subnet. Also, if
promiscuous ARP is used to bypass a specific gateway address, an ARP reply
should include the address mask associated with the subnet the address belongs
to (if known).

The requisite semantics and packet formats might not be hard to provide, since
ARP and RARP are quite generic. While adding the subnet mask, the
type-of-service mask should also be added. The ICMP Address Mask messages
should be junked.

Martian Filters

The idea of Martian Filters originated as a weapon to combat datagrams
carrying bogus destination addresses (like 127.0.0.0), which can be emitted by
broken bsd Unix systems. These datagrams sometimes escape their local net and
are found wandering about the Internet and creating a significant hazard to
swamp navigation. An unbelievable brew of this stuff was found sloshing over
USAN, which prompted my earlier message on this subject.

Martian Filters search for "reserved," "broadcast" and "don't-know" addresses
identified in the Assigned Numbers list and discard datagrams (without ICMP
error notification) to those destinations, unless addressed to the
host/gateway directly. Use of this filter prevents the recipient from
forwarding a broadcast datagram back onto the cable or sending disruptive ICMP
error messages in the case of unknown protocols or ports appearing in
broadcast datagrams. We found these filters to be absolutely necessary to
avoid chaos (pun) on USAN.

When subnets are in use, a receiver may not know that a particular IP address
in fact represents a broadcast, except that it presumably came in on a
datagram with a broadcast address in the Ethernet destination-address field.
The Martian Filter we used is subnet-independent and filters out only the
well-known network-level broadcast and don't-know addresses. However, the
fuzzballs, at least, know what subnet they are on and grab local-net broadcast
and don't-know addresses before they can be sent onward and create mischief.

If it can be agreed that the network-of-subnets semantics mentioned above
(i.e. disallowed) is correct, then the broadcast and don't-know filters can be
implemented more efficiently and effectively. More study and consensus is
needed in this area.

Asymmetric Paths and Redirects

From the above diagram you can see that, when airport DCN8 closes, flights to
FORD1 can continue via DCN5, UMICH1/3 and FORD14 as determined by the
the routing algorithm. ARPs for FORDnet will then be returned by DCN5 and all
other hosts will return IMCP Redirect messages as expected. Now, it turns out
that some silly host on UMDnet thinks the slickest airway to FORD1 is out
MILnet, back in DCN1 and radar vectors to FORD1, but the DCnet controllers
know the smoothest air is via DCN5 and UMD1. Well, all that asymmetry works,
too, until such time as DCN8 opens up again.

When the routing algorithm finds that DCN8 is open, everything shuffles as
expected in DCN5 and DCN8; however, the other DCnet hosts sharing the
Ethernet may not realize this, since their minimum-delay path is still via the
Ethernet. Ordinarily, these hosts will update their routing tables correctly
when they send traffic to FORD1, since DCN5 will redirect that traffic to
DCN8.

The problem lies in the question: Which local-net host (Ethernet address) does
DCN5 send the redirect to? If it doesn't know better, it simply looks in its
routing tables for the sender (UMD host) and determines the route via DCN5 to
UMD1, but it would be nonsense to send the redirect there. However, the
incoming traffic actually came via DCN1, so the redirect should really be sent
to it instead.

Good implementation technique tries to separate protocol layers cleanly, which
suggests Ethernet addresses be propagated no further up the stack than
absolutely necessary. Since an ICMP redirect does, after all, have an IP
header, the temptation is to treat it like other ICMP messages as a wart on
the side of the network (IP) layer. But other ICMP messages don't have this
insiduous coupling to the local-network (Ethernet) layer, so no provisions
were made in my original design to save the Ethernet addresses at all.

Well, I shambled back to the hanger and tinkered the fuzzware, so now all
controllers are on the same frequency. My solution was to remember the
Ethernet source address as each Ethernet packet arrives and stuff it in the
destination address of the outbound ICMP Redirect message.

Some of you may remember my previous messages which argued against propagating
Ethernet addresses up the stack and suggest I got what I deserved. I am still
opposed in principle to spreading local-net layer semantics (like broadcast
addresses) outside that layer and believe such semantics should be re-created
at the network level (e.g. use IP broadcast address) as necessary.
Unfortunately, for reasons mentioned a paragraph or two back, I may have to
eat them words.

Dave
-------