mills@DCN6.ARPA (07/07/86)
Folks, Don't start the following story unless you enjoy solving puzzles and have a few minutes to study and reflect on the issues. Be advised it is highly technical, not without personal bias and may leave some of you with elevated cranial energies. I especially would like our ANSI/ISO protocol designers to take this thing into their committees and spread mischief (Overheard in X3.S3: "Ya mean those Internet buzzards are doing THAT!?"). The cast of characters includes the NSFnet Backbone, which is now being installed between six Supercomputer sites, plus a bunch of Ethernets at those sites. Some of the sites are also interconnected via the USAN net, which uses a multiple-access Vitalink/TransLAN satellite channel, and some are connected to the ARPAnet via the Ethernets and other nets and gateways. The Backbone gateways, as well as some of the USAN and ARPAnet gateways involved, consist of LSI-11 "fuzzball" systems, which are reasonably good players of the ARP and ICMP game, as well as run their own routing algorithm. The following schematic shows the configuration of these swamps expected by late August. All of the nodes and lines shown are already installed, except for some Backbone line-interface equipment. All of the non-Backbone nodes have been in operation for some time, while three of the Backbone nodes (SDSC, NCAR, Cornell) are presently interconnected and in operation. Only the fuzzballs are shown, since this discussion concerns primarily them; however, you should recognize there are lots and lots of other hosts on these nets and that some nets include many different media besides Ethernets. +---ARPAnet | ====0==== ==0===0== ====0==== 192.17.47| 128.84.253| 128.121.50| ===== Ethernet +---+---+ +-----+-+ +---+---+ ----- serial line | | 56 | | 56 | | (speeds in kbps) | NCAR +-------+ CTC +-------+ JVNC | | | | | | | +-+---+-+ +-------+ +---+---+ | | 56 | 56| +-------------+ |56 | | | +-+-----+ +---+---+ +---+---+ +-------+ | | 56 | | 56 | | 56 | | | SDSC +-------+ UIUC +-------+ CMU +-------+ UMD | | | | | | | | | +---+---+ +---+---+ +-----+-+ +---+---+ 192.12.207| 192.17.5| 128.2.49| 128.8.1| ====0==== ====0==== ==0===0== ==0=0=0== | | | NSFnet Backbone +---ARPAnet | +---ARPAnet . . . . . . . . . . . . . . . . . . . . . . . . . . | | FORDnet . UMICHnet . | +---MILnet . . | +-------+ . +-------+ +-------+ . +-+-----+ | | 4.8 | | 120 | | . | | |FORD14 +-------+UMICH3 +-------+USAN-GW| . | UMD1 | | | . | | | | . | | +-----+-+ . +-----+-+ +---+---+ . +---+---+ 128.5.0| . 35.1.1| 192.17.4| . | ==0===0== . ==0===0== ====0==== . | | . | USAN . | +-+-----+ . +-+-----+ (9 sites) . | | | . | | . | | FORD1 | . |UMICH1 | . |4.8 | | . | | . | +---+---+ . +---+---+ . UMDnet| . . . . . | . . . . . . . | . . . . . . . . . . . . . . . | . . . DCnet |9.6 |9.6 | +---+---+ +---+---+ | | | | | | | DCN8 | | DCN5 +---------------------------+ | | | | +---+---+ +---+---+ | 128.4.0 | ====0=======0=======0==== | +---+---+ | | | DCN1 +---ARPAnet | | +-------+ The players are: NCAR National Center for Atmospheric Research CTC Cornell Theory Center (hardware integration and net operator) JVNC John von Neumann Center SDSC San Diego Supercomputer Center UIUC University of Illinois/University of Chicago (project management) CMU Carnegie-Mellon University UMD University of Maryland (software integration) UMICH University of Michigan (installation and test) DC Fuzzball creche in Vienna, VA FORD Ford Scientific Research Labs in Dearborn, MI At the moment, all of the ARPAnet/MILnet gateways except DCN1 (aka DCN-GATEWAY) include only their own directly-connected client nets in EGP updates sent to core gateways. DCN1 temporarily includes some of the other nets for debugging and test purposes. Therefore, traffic for the other nets must wander to DCN1 and swamp-fuzzball trail to USAN and Backbone trunks (UMD is presently not operational). Eventually, Backbone connectivity may be provided to the ARPAnet by one or more gateways at CMU, CTC or UMD as well. Also at the moment, the Ethernets at some Backbone sites are connected via USAN to each other and via USAN-GW at UMICH to the Interworld. As you can see, there will be a lush supply of routes available, with many sites enjoying connectivity via the Backbone, USAN and ARPAnet paths simultaneously. Believe it or not, traffic actually flies these airways, although sometimes landing in very strange airports. The fuzzballs shown operate an adaptive routing algorithm which selects primary routes based on minimum delay and can select alternate routes based on IP type-of-service field and other factors. Casual observation of the DCnet Ethernet reveals there may be a lot of local-site problems yet to solve. For instance, I spotted two hosts on Cornell local nets working each other via the DC Ethernet (!?!) the other day. Ya gotta see to believe. The ring of hosts and Ethernets at DCnet, UMICHnet and FORDnet has been a valuable prototyping facility, which is its primary service function. Alternate routing in case of failure requires ICMP Redirect and ARP functions to work properly, both in the fuzzballs and in other network hosts, which are represented by a wide variety of VMS, Unix and related systems. A problem in this area is what prompted this message. Recently, Hans-Werver Braun of UMICH and I endured scary experiences while teaching Etherbunnies, fuzzygators and other strange creatures to swim in these swamps. Especially enlightening was USAN, with its nine rambunctious Ethernets, all piled onto the same channel and babbling everything from DECnet and XNS mumbles to Unix rwho shrieks zipping about like lost cosmic particles. It took us two weeks at DefCon 5 to harden the fuzzball silos (which added a couple of dB of their own routing broadcasts) and make space safe for Backbone debug and test. Some of the lessons learned have already been broadcast for public enjoyment and may even have medicinal value. Additional observations are herewith submitted for your entertainment, education and as a basis for comments and suggestions. What I hope we all get out of this exercise is not so much a blueprint for how to deal with the incredibly complicated brushfires we know are circling the horizon (to quote Dave Clark), but rather as an experiment and proof-of-concept suggesting generic issues that need further study and resolution before we send out the fire brigade. Multiplexed Ethernets Most of the interesting lessons were learned on USAN, which radio amateurs will recognize as similar to the pileup when Pitcairn Island shows up on 20 meters. The USAN design was never intended to handle the bedlam of nine simultaneous flocks of squawking rhos, rips and other broadcast honkers, much less the blatant ICMP squawks from the poor creatures that don't understand them. The problem is, of course, that those of us who jimmied the Ethernet protocols while draining our own swamps never thought of making a single cable safe for multiple nets and subnets at the same time. Hans-Werner and I found it useful to forget that the cable is normally protected from crosstalk between neighbor nets and pretend it is a willy-nilly-access packet-radio channel instead. You may not like this model and prefer a much more carefully regimented and regulated approach. We would like to understand what-if first and learn all we can before the wires are insulated and the lessons have to be relearned under meltdown conditions when the insulation wears off somewhere. Even if we agree multiplexed Ethernets are beyond the pale, this exercise may reveal how to harden the hosts and gateways against rogues (and jammers) that might occur from time to time. Therefore, we simply connected everything up and let the packets fly. Hans-Werner and I are still catching our breath, some wheezes of which can be found in the following. We are putting together an almanac, which may appear as a footnote to RFC-985, to teach lessons something like the following: Multiplexed Ethernets are extremely complex and delicate, but may represent a useful solution in exceptional cases. If you are silly enough to contemplate such folly, do it in the following way... Type-of-Service Routing Our research community has been stalking the type-of-service routing issue for awhile now and may have fallen in the wrong pits. It turned out to be easy for the fuzzballs to take a first cut at this simply by extending the address fields in the routing tables to include the IP type-of-service field (actually, the extension consists of a single octet, with each bit corresponding to each of the eight combinations of the Throughput, Delay and Reliability bits). The hard part begins when you try to make ARP, ICMP error messages and ICMP Redirects work in that model. The goal would be to maintain possibly separate routes for every combination of type-of-service specification. The main problem is that the packet formats are not rich enough to distinguish this information to the detail required. In addition, much of the existing host software must be changed in nontrivial ways (e.g. the ARP cache gets an extra octet, etc.). The fuzzball routing data structures already include such provisions, but the algorithms have not been evolved to exploit them yet. Alternate Routes It was our intention to explore the consequences of trying to provide alternate routes should something quit and where not all paths were monitored by the fuzzball routing algorithm. For instance, if a backbone trunk were broken, it might be possible to route out the ARPAnet and back in again (please table administrative discussion - we just want to explore how the darn thing might work, not whether it would be allowed or not). As an experiment, The fuzzballs were fiddled so that, if an internal link controlled by the fuzzball routing algorithm broke, traffic could be directed via a designated backup path. This was tested in the DCN5 - UMD1 link, with designated backup path via the ARPAnet/MILnet gateways on each net, which is ordinarily controlled by EGP and the core system. The effect is that, when the DCN5 - UMD1 link is up, traffic flows via it, but, when the link is down, traffic flows the long way around via ARPAnet, MILnet and thousands of gateways. This was a cute experiment and worked just fine; however, its success depends on carefully engineered tables and delicate assumptions about the functionality and dynamics of both the fuzzball and EGP algorithms. More study and experiment is needed here. Subnets All of the class-B nets shown in the schematic are subnetted, with the third octet interpreted as the subnet number. The class-A UMICHnet is subnetted as well. Good subnetting practice is to avoid the use of subnet-number fields containing all zero bits or all one bits, since this can lead to confusing interpretation of broadcast scope, for example. DCnet and FORDnet fall victim to this observation. I can't believe for a minute that vendor products without subnetting capability will have a long life in this era. Broadcast and Don't-Know Addresses The Internaut Handbook specifies that the local-subnet address with all ones in the host portion should be interpreted as "broadcast" and that the address with all zeros should be interpreted as "don't know." Under these interpretations, the former is legal only in destination-address fields, while the latter (used by a diskless workstation while RARPing for its address, for example) is legal only in source-address fields. Where subnets are in use, the scope of this interpretation extends only to the given subnet, if the subnet-number field is neither all zeros or all ones, and to the entire network of subnets if either or all zeros or all ones. We observed numerous abuses of this model, including the use of zeros as the broadcast address (older bsd Unix) and various tangles with broadcasting in a subnet environment. In point of fact, it doesn't matter which convention is followed in a particular subnet, as long as all hosts and gateways on that subnet understand which is in use, as well as the subnet mask, and agree never to propagate either local-broadcast or local-don't-know datagrams beyond the gateways. The same consideration applies at the network-of-subnets level, of course. For reasons that should be evident from the following, I believe the use of zeros and ones in the subnet-number field and the above interpretation should be avoided in favor of explicit broadcast agents. One implication of this model is that broadcasts would never be propagated by a gateway. I further believe that a very careful coupling should be maintained between the semantics of the Ethernet broadcast/multicast addresses and IP broadcast addresses; otherwise, the implementation is forced not only to carry all kinds of wierd semantics up the protocol stack, but its error detection is seriously handicapped as well. A paranoid receiver may well check that, when an IP packet with an Ethernet broadcast/multicast destination address arrives, the destination-address field must contain the IP subnet-broadcast address or the packet is discarded. This would have the effect of disallowing random Ethernet broadcasts to designated IP destinations if the sender had a broken or unimplemented ARP, for example. It would also disallow cross-net routing broadcasts, such as the fuzzballs use to manage USAN routing. More thought is needed here. Broadcasting Semantics Nothing we found exhibited stranger behavior than the broadcast semantics of the various Ethernets. The most disruptive thing by far was the tendency of some receivers to "helpfully" relay broadcast packets with destination IP address other than the receiver to their "intended" destination. An innocent rwho from a host on one subnet of a multiplexed cable ignites instant abuse of the gateways if the hosts on another subnet do this. In extreme cases the network can fall into a debilitating, oscillatory state (called meltdown), where the entire cable bandwidth is consumed by these packets. A general principle of nuclear engineering is that reactor meltdown is possible only if more energy gazouta the reactor than gazinta it. Ethernet meltdown cannot occur if it can be gauranteed that no more than one gazouta packet can ever be produced by a single gazinta packet at a node. Thus, a forwarded broadcast packet (hereby named a Chernobyl packet - you first heard it here) can produce meltdown only if more than one receiver is involved. Of course, if a Chernobyl packet is assigned an Ethernet broadcast address, the meltdown would occur within milliseconds. I believe no receiver (host or gateway) should ever forward broadcast packets onward to a subsequent destination, unless the receiver is an explicitly designated broadcast agent which explicitly understands and maintains the spanning trees and routing algorithms necessary to reach the intended destination without meltdowns. What are the groundrules of a broadcast service? Those such as rwho, rip and fuzzball routing do not involve an explicit reply from each of possibly many receivers. Those such as ARP, RARP and ICMP Address Mask may produce a meteor shower of responses if multiple receivers can respond. In some cases where efficiency can be sacrificed for reliability, meteor showers may be acceptable, but in others this would be disruptive. A case might be made for datagram services like the above and maybe others, but this would be silly for connection-oriented services. Experiments with broadcast TCP service and unhardened fuzzballs led to hilarious scenarious leading to marginal meltdown, suggesting that a multiple-destination semantics may have to be carried up the protocol stack anyway, at least for error detection. ICMP Error Messages The Internaut Handbook suggests ICMP error messages should be returned to the ultimate source if a datagram cannot be delivered to its intended destination. Some hosts interpret this literally and return ICMP error messages if the protocol or port fields do not match a service provided by the receiver. We discovered this quickly in the case of the fuzzball routing algorithm, which broadcasts Hello messages on protocol 63 (private use) from time to time, each one creating a shower of ICMP error messages. I believe no receiver (host or gateway) should ever return an ICMP error message unless it can determine with fair reliability that any other copies that might be wandering around the network will be routed by that receiver and that the same ICMP error message would be produced in each case. This is a general statement and applies to scenarios other than broadcast. One implication, for example, is that ICMP error messages must be considered non-deterministic, since one path can be temporarily stoppered-up while the routing algorithm is thrashing and duplicates are successfully negotiating another path. With respect to broadcast, ICMP error messages should never be sent in response to a received broadcast packet. Another way of looking at it is that no message should ever be sent if its semantics would involve the broadcast address as the source address (IP or Ethernet) in the packet. Subnet Masks No gateway implementation known to me (except the fuzzball as of today) supports the ICMP Address Mask messages described in RFC-950. If a gateway joins two subnets, it has to know which subnet the request is for, so it can determine the proper mask. If the requestor knows its own address, it can broadcast the request and the gateway(s) can determine which subnet from the source IP address of the request. If it does not know its own address, it must use RARP to find it (the alternative suggested in RFC-950 is simply too bizzare to contemplate in this environment). This is the only known case that requires broadcast of an ICMP message, which not only complicates the implementations, but suggests that maybe something is broken in the model. The real problem is that the ARP/RARP semantics are simply not rich enough and should be expanded to include the mask in the first place. For instance, a RARP reply should include not only the specific host address, but also the address mask associated with its subnet. Also, if promiscuous ARP is used to bypass a specific gateway address, an ARP reply should include the address mask associated with the subnet the address belongs to (if known). The requisite semantics and packet formats might not be hard to provide, since ARP and RARP are quite generic. While adding the subnet mask, the type-of-service mask should also be added. The ICMP Address Mask messages should be junked. Martian Filters The idea of Martian Filters originated as a weapon to combat datagrams carrying bogus destination addresses (like 127.0.0.0), which can be emitted by broken bsd Unix systems. These datagrams sometimes escape their local net and are found wandering about the Internet and creating a significant hazard to swamp navigation. An unbelievable brew of this stuff was found sloshing over USAN, which prompted my earlier message on this subject. Martian Filters search for "reserved," "broadcast" and "don't-know" addresses identified in the Assigned Numbers list and discard datagrams (without ICMP error notification) to those destinations, unless addressed to the host/gateway directly. Use of this filter prevents the recipient from forwarding a broadcast datagram back onto the cable or sending disruptive ICMP error messages in the case of unknown protocols or ports appearing in broadcast datagrams. We found these filters to be absolutely necessary to avoid chaos (pun) on USAN. When subnets are in use, a receiver may not know that a particular IP address in fact represents a broadcast, except that it presumably came in on a datagram with a broadcast address in the Ethernet destination-address field. The Martian Filter we used is subnet-independent and filters out only the well-known network-level broadcast and don't-know addresses. However, the fuzzballs, at least, know what subnet they are on and grab local-net broadcast and don't-know addresses before they can be sent onward and create mischief. If it can be agreed that the network-of-subnets semantics mentioned above (i.e. disallowed) is correct, then the broadcast and don't-know filters can be implemented more efficiently and effectively. More study and consensus is needed in this area. Asymmetric Paths and Redirects From the above diagram you can see that, when airport DCN8 closes, flights to FORD1 can continue via DCN5, UMICH1/3 and FORD14 as determined by the the routing algorithm. ARPs for FORDnet will then be returned by DCN5 and all other hosts will return IMCP Redirect messages as expected. Now, it turns out that some silly host on UMDnet thinks the slickest airway to FORD1 is out MILnet, back in DCN1 and radar vectors to FORD1, but the DCnet controllers know the smoothest air is via DCN5 and UMD1. Well, all that asymmetry works, too, until such time as DCN8 opens up again. When the routing algorithm finds that DCN8 is open, everything shuffles as expected in DCN5 and DCN8; however, the other DCnet hosts sharing the Ethernet may not realize this, since their minimum-delay path is still via the Ethernet. Ordinarily, these hosts will update their routing tables correctly when they send traffic to FORD1, since DCN5 will redirect that traffic to DCN8. The problem lies in the question: Which local-net host (Ethernet address) does DCN5 send the redirect to? If it doesn't know better, it simply looks in its routing tables for the sender (UMD host) and determines the route via DCN5 to UMD1, but it would be nonsense to send the redirect there. However, the incoming traffic actually came via DCN1, so the redirect should really be sent to it instead. Good implementation technique tries to separate protocol layers cleanly, which suggests Ethernet addresses be propagated no further up the stack than absolutely necessary. Since an ICMP redirect does, after all, have an IP header, the temptation is to treat it like other ICMP messages as a wart on the side of the network (IP) layer. But other ICMP messages don't have this insiduous coupling to the local-network (Ethernet) layer, so no provisions were made in my original design to save the Ethernet addresses at all. Well, I shambled back to the hanger and tinkered the fuzzware, so now all controllers are on the same frequency. My solution was to remember the Ethernet source address as each Ethernet packet arrives and stuff it in the destination address of the outbound ICMP Redirect message. Some of you may remember my previous messages which argued against propagating Ethernet addresses up the stack and suggest I got what I deserved. I am still opposed in principle to spreading local-net layer semantics (like broadcast addresses) outside that layer and believe such semantics should be re-created at the network level (e.g. use IP broadcast address) as necessary. Unfortunately, for reasons mentioned a paragraph or two back, I may have to eat them words. Dave -------