[comp.os.xinu] Xinu Internet Gateway: a report

dls@mace.cc.purdue.edu (David L Stevens) (01/22/89)

	The following is a description of some Xinu work done as a semester
project in Doug Comer's graduate level course on Internetworking here at
Purdue. We divided the class into groups of 1 or 2 people and built Internet
Gateways on top of the Sun version of Xinu V7. This describes the experiences,
successes and failures of one of the nine groups. Much of the work is the
result of ideas or suggestions from Rajendra Yavatkar, Jim Griffoen, Doug
Comer and through discussion with other members of the class.
	Because of the size of this project and the limited bandwidth offered
by newsgroups, I've left out many details in this description so that I might
complete it with a single posting. Feel free to contact me via e-mail for
particulars; I'll respond to all queries as time permits.

Othernets on Ether
	In order to implement a large number of gateways without actually
requiring several dedicated machines and multiple physical networks, we took
the approach of simulating a fictional network technology, called "Othernet,"
on top of physical Ethernet hardware. To keep Ethernet addresses distinct, we
used multicast addresses on Ethernet hardware that has programmable address
recognition (Sun's standard LANCE Ethernet chip). Each machine would recognize
its "real" Ethernet address and some number of Othernet addresses configured
at boot/compile time.
	Comer gave each group an id (a number) and a simple scheme for
computing distinct Othernet hardware and IP addresses based on the group id
(and the machine's Ethernet address, for non-broadcast addresses) so that
multiple simulations could run on the same (physical) network without
interfering.

Othernet Simulation
	The Othernets are indistinguishable from real devices in the
configuration file. Each has othinit(), othread(), othwrite(), etc. The
interrupt function is ethinter() and one function of othinit() is to associate
each simulated Othernet with an Ethernet. It keeps the simulated machine and
broadcast addresses in the device control block for the Ethernet. When a
packet arrives, ethinter() demultiplexes it to the appropriate Xinu device
by comparing the address in the packet buffer with the address list for the
simulated devices. Promiscuous mode (not part of stock v7) is still supported
because packets that don't match the simulated addresses go to the ETHER
device by default, without an address comparison.
	The Othernet device control block, among other things, includes the
associated Ethernet device, so it can honor the Ether device's write semaphore.

A Generic Network Interface layer
	With multiple interface technologies (real or simulated) comes device
type dependent detailed information which is not directly relevant to the basic
network I/O that IP needs. To hide this information from IP, we added a new
layer, the "Network Interface" layer, which includes I/O queues, interface
statistics, the maximum transfer unit and IP address information for the
purpose of routing. It also includes hardware addresses, but with size
information which can thus accomodate more than just Ethernets.
	Each network interface has it's own netin() and netout() that do
the blocking on reads and writes and queue/dequeue packets bound for IP.
Other protocols (eg, ARP and RARP) they implement without an intermediate
queue. The IP process acts as a central switch with queues to/from the netin()
and netout() processes, so that it may process packets from multiple interfaces
without blocking on any one. The functions putp() and getp() handle these
queues, among other things.
	Note that local packets are routed via IP just like outbound packets.
The netout() function for this "local interface" looks up the appropriate UDP
port or ICMP id and delivers the packet to the correct process. It handles
ICMP functions that don't interact with local processes directly.

A Routing Table
	We define routing table entries as follows:

struct route {
	IPaddr	rt_net;		/* net for this route	*/
	IPaddr	rt_mask;	/* mask for this net	*/
	IPaddr	rt_gw;		/* next IP hop		*/
	short	rt_metric;	/* distance metric	*/
	short	rt_intf;	/* interface number	*/
	short	rt_key;		/* sort key		*/
	short	rt_ttl;		/* time to live		*/
	struct	route *rt_next;	/* next for this hash	*/
/* stats */
	int	rt_refcnt;	/* current ref count	*/
	int	rt_usecnt;	/* use total		*/
};
	The routing table is globally locked during lookups and the reference
count prevents newly deleted routes from being freed before all references
disappear. Every route has a timer associated with it, with a special value
"RT_INF" to mean "don't expire this route" (though this latter wasn't needed,
since we used RIP and no static routes).
	The lookup mechanism is a hash based on the low order portions of
the IP network number (not including host or subnet). We order entries with
the same hash by the number of bits in the mask, so that matches occur on
most specific (ie, host) routes first and successively towards least specific
routes (standard IP net routes). Although subnet masks are not defined to
follow this ordering, multiple matches on a subnet of that sort is ambiguous
and in practice the problem doesn't occur. These semantics allow for easy
implementation of a subnet hierarchy, rather than just mutually exclusive
subnet numbering, as well.
	If the above scheme fails to find a matching route, the getrt()
function returns a distinguished default route (network 0.0.0.0), if one
was set.
	One shortcoming of this implementation is that routes have a single
metric. A better solution would be to have (potentially) multiple
(protocol,metric) pairs as outlined in MIB (RFC 1066). Thus, eg, EGP and RIP
metrics would be distinct. As it is, we only implement RIP, so the conflict
does not occur (yet).

The IP Process
	The IP process is a simple loop that gets a packet off of one of the
interface queues (round robin via function "getp()") and routes the packet
to an appropriate interface based on the routing table. Some additional
complications come because of checks for identical IN/OUT interfaces (for
ICMP redirects) and it is here that we do directed broadcast interpretation,
so that both the gateway (acting as a host) receives a directed broadcast
as well as forwarding it to other hosts on the common network.
	Finally, the IP process computes new time to live and checksum fields
for all routed packets and stamps the source IP address for locally generated
packets.
	The function putp() actually queues the packet on the selected
interface's output queue for delivery.

IP Reassembly
	For IP reassembly queues, we use a set of generic priority queue
functions (also used in the routing table) to order the received fragments
by offset. We add fragment packet buffers to the queue of fragments with the
same IP address and IP id. When all are present, we allocate a buffer from
a special "large buffer pool" and copy the fragments in. After IP header
adjustments on the newly created datagram, we return the completed datagram
to the higher layers.
	We set a timer for each fragment queue and reset it for each fragment
we add to the queue. If it expires, we free the fragments and generate the
ICMP fragment timed out message.
	If any errors occur on a partially completed reassembly queue, we
free all of the fragments on the queue and leave a "stub" queue that simply
collects other inbound fragments for this queue and discards them; eventually,
the timer expires and deletes the queue stub.
	Reassembly is part of the local interface's netout() process, since
only packets bound for the local host are candidates for reassembly.

IP Fragmentation
	Fragmentation is one of the functions of putp(). It computes the
fragment size based on the maximum transfer unit of the selected interface
and the "right shifted by three" semantics of the IP fragment offset header
field. It then breaks packets up as needed, duplicates and corrects the IP
header and queues the packet on the interface's output queue.

ICMP
	We implement virtually all of ICMP, with the exceptions listed below.
We do this as two functions; icmp_in() handles ICMP messages bound for the
local machine directly. This includes routing table changes (from redirects),
process demultiplexing and delivery (for echo replies, eg.) and all of the
requests from other hosts (information, mask, etc).
	The icmp() function is how we generate ICMP messages for a remote
host or gateway. We use it throughout the network code for generating redirects,
error messages, mask requests, etc.
	The ICMP functions we do not support are ICMP SRC QUENCH, TIMESTAMP
and the SRC ROUTE error for DESTINATION UNREACHABLE.

Hosts and Gateways
	We organized the project into two separate directories for building
kernel images, both sharing the same sources. Where appropriate, we used
the ifdef's or ifndef's for "GATEWAY" to distinguish hosts and gateways.
	The hosts configure only a simulated Othernet and act in isolation
otherwise. All routing information, mask information, time setting, host
name translations, etc come from the gateway or by packets routed through
it.
	All hosts run the same image. The assigned Internet network number
for the simulation we select at boot (read from the console, after a prompt)
and the IP host part and Othernet hardware numbers are designed based on the
hardware addresses, so no special care is needed to insure they are all unique
is required.
	The gateway boots with some knowledge already, including the subnet
masks for the interfaces. It acquires all possible, including routing
information, dynamically, though.

RIP
	We implement active and passive RIP, Split Horizon and correct
interpretation of RIP Infinity. We also translate subnet routes to net
routes when broadcasting on non-subnetted networks. We implement all that
current popular RIP implementations do, but we do not support full RFC 1058
RIP. In particular, we do not support Poison Reverse and Triggered Updates.

Unfinished Business
	We began an implementation of SGMP. The data structures and basic
variable get/set functions are in place, but with no network transactions
support.
	We also have hooks, but no more, for EGP.

In Closing
	Though ambitious in scale, this project provided direct experience
with the problems, often subtle, that arise from Internetwork engineering.
By actually building a working Internet gateway, we gained direct experience
that would not be possible with strictly classroom lecture.

-- 
					+-DLS  (dls@mace.cc.purdue.edu)