[net.mail] What the heck is a domain?

mark@cbosgd.UUCP (01/29/84)
This is intended as a quick introduction to domains.  For more details,
you should read some of the documents referenced at the end.

So what is a domain?  Basically, it's a way of looking at the world as
a heirarchy (tree structure).  You're already used to using two tree
world models that work pretty well: the telephone system and the post
office.  Domains form a similar heirarchy for the electronic mail
community.

The post office divides the world up geographically, first into
countries, then each country divides itself up, those units subdivide,
and so on.  One such country, the USA divides into states, which divide
into counties (except for certain states, like Louisiana, which divide
into things like parishes), the counties subdivide into cities, towns,
and townships, which typically divide into streets, the streets divide
into lots with addresses, possibly containing room and apartment
numbers, the then individual people at that address.  So you have an
address like

	Mark Horton
	Room 2C-249
	6200 E. Broad St.
	Columbus, Ohio, USA

(I'm ignoring the name "AT&T Bell Laboratories" and the zip code, which
are redundant information.)  Other countries may subdivide differently,
for example many small European countries do not have states.

The telephone system is similar.  Your full phone number might look
like 1-614-860-4276 x234 This contains, from left to right, your
country code (bet you didn't know the USA had country code "1", did
you?), area code 614 (Central Ohio), 860 (a prefix in the Reynoldsburg
C.O.), 4276 (individual phone number), and extension 234.  Some phone
numbers do not have extensions, but the phone system in the USA has
standardized on a 3 digit area code, 3 digit prefix, and 4 digit phone
number.  Other countries don't use this standard, for example, in the
Netherlands a number might be +46 8 7826034 (country code 46, city code
8, number 7826034), in Germany +49 231 7554824, in Sweden +31 80
558883, in Britain +44 227 66822 or +44 506 411583.  Note that the
country and city codes and telephone numbers are not all the same
length, and the punctuation is different from our North American
notation.  Within a country, the length of the telephone number might
depend on the city code.  Even within the USA, the length of extensions
is not standardized: some places use the last 4 digits of the telephone
number for the extension, some use 2 or 3 or 4 digit extensions you
must ask an operator for.  Each country has established local
conventions.  But the numbers are unambigous when dialed from
left-to-right, so as long as there is a way to indicate when you are
done dialing, there is no problem.

A key difference in philosophy between the two systems is evident from
the way addresses and telephone numbers are written.  With an address,
the most specific information comes first, the least specific last.
(The "root of the tree" is at the right.)  With telephones, the least
specific information (root) is at the left.  The telephone system was
designed for machinery that looks at the first few digits, does
something with it, and passes the remainder through to the next level.
Thus, in effect, you are routing your call through the telephone
network.  Of course, the exact sequence you dial depends on where you
are dialing from - sometimes you must dial 9 or 8 first, to get an
international dialtone you must dial 011, if you are calling locally
you can (and sometimes must) leave off the 1 and the area code.  (This
makes life very interesting for people who must design a box to call
their home office from any phone in the world.)  This type of address
is called a "relative address", since the actual address used depends
on the location of the sender.

The postal system, on the other hand, allows you to write the same
address no matter where the sender is.  The address above will get to
me from anywhere in the world, even private company mail systems.  Yet,
some optional abbreviations are possible - I can leave off the USA if
I'm mailing within the USA; if I'm in the same city as the address, I
can usually just say "city" in place of the last line.  This type of
address is called an "absolute address", since the unabbreviated form
does not depend on the location of the sender.

The ARPANET has evolved with a system of absolute addresses:
"user@host" works from any machine.  The UUCP network has evolved with
a system of relative addresses: "host!user" works from any machine with
a direct link to "host", and you have to route your mail through the
network to find such a machine.  In fact, the "user@host" syntax has
become so popular that many sites run mail software that accepts this
syntax, looks up "host" in a table, and sends it to the appropriate
network for "host".  This is a very nice user interface, but it only
works well in a small network.  Once the set of allowed hosts grows
past about 1000 hosts, you run into all sorts of administrative
problems.

One problem is that it becomes nearly impossible to keep a table of
host names up to date.  New machines are being added somewhere in the
world every day, and nobody tells you about them.  When you try to send
mail to a host that isn't in your table (replying to mail you just got
from a new host), your mailing software might try to route it to a
smarter machine, but without knowing which network to send it to, it
can't guess which smarter machine to forward to.  Another problem is
name space collision - there is nothing to prevent a host on one
network from choosing the same name as a host on another network.  For
example, DEC's ENET has a "vortex" machine, there is also one on UUCP.
Both had their names long before the two networks could talk to each
other, and neither had to ask the other network for permission to use
the name.  The problem is compounded when you consider how many
computer centers name their machines "A", "B", "C", and so on.

In recognition of this problem, DOD has established a new way to name
computers based on domains.  The ARPANET is pioneering the domain
convention, and many other computer networks are falling in line, since
it is the first naming convention that looks like it really stands a
chance of working.  The MILNET portion of ARPANET has a domain, CSNET
has one, and it appears that Digital, AT&T, and UUCP will be using
domains as well.  Domains look a lot like postal addresses, with a
simple syntax that fits on one line, is easy to type, and is easy for
computers to handle.  To illustrate, an old routed UUCP address might
read "sdcsvax!ucbvax!allegra!cbosgd!mark".  The domain version of this
might read "mark@d.osg.cb.att.uucp".  The machine is named
d.osg.cb.att.uucp (UUCP domain, AT&T company, Columbus site, Operating
System Group project, fourth machine.)  Of course, this example is
somewhat verbose and contrived; it illustrates the heirarchy well, but
most people would rather type something like "cbosgd.att.uucp" or even
"cbosgd.uucp", and it is quite possible to set up domains so that you
don't have to type very much.

You may wonder why the single @ sign is present, that is, why the above
address does not read "mark.d.osg.cb.att.uucp".  In fact, it was
originally proposed in this form, and some of the examples in RFC819 do
not contain an @ sign.  The @ sign is present because some ARPANET
sites felt the strong need for a divider between the domain, which
names one or more computers, and the left hand side, which is subject
to whatever interpretation the domain chooses.  For example, if the ATT
domain chooses to address people by full name rather than by their
login, an address like "Mark.Horton@ATT.UUCP" makes it clear that some
machine in the ATT domain should interpret the string "Mark.Horton",
but if the address were "Mark.Horton.ATT.UUCP", routing software might
try to find a machine named "Horton" or "Mark.Horton".  (By the way,
case is ignored in domains, so that "ATT.UUCP" is the same as
"att.uucp".  To the left of the @ sign, however, a domain can interpret
the text any way it wants; case can be ignored or it can be
significant.)

It is important to note that DOMAINS ARE NOT ROUTES.  Some people look
at the number of !'s in the first example and the number of .'s in the
second, and assume the latter is being routed from a machine called
"uucp" to another called "att" to another called "cb" and so on.  While
it is possible to set up mail routing software to do this, and indeed
in the worst case, even without a reasonable set of tables, this method
will always work, the intent is that "d.osg.cb.att.uucp" is the name of
a machine, not a path to get there.  In particular, domains are
absolute addresses, while routes depend on the location of the sender.
Some subroutine is charged with figuring out, given a domain based
machine name, what to do with it.  In a classy environment like the
ARPA Internet, it can query a table or a name server, come up with a 32
bit host number, and connect you directly to that machine.  In the UUCP
environment, we don't have the concept of two processes on arbitrary
machines talking directly, so we forward mail one hop at a time until
it gets to the appropriate destination.  In this case, the subroutine
decides if the name represents the local machine, and if not, decides
which of its neighbors to forward the message to.

So, after all this background, we still haven't said what a domain is.
The answer (I hope it's been worth the wait) is that a domain is a
subtree of the world tree.  For example, "uucp" is a top level domain
(that is, a subtree of the "root".) and represents all names and
machines beneath it in the tree.  "att.uucp" (or just "att" if there
isn't some other "att" to confuse it with) is a subdomain of "uucp",
representing all names, machines, and subdomains beneath "att" in the
tree.  Similarly for "cb", "osg", and even "d" (although "d" is a
"leaf" domain, representing only the one machine).

A domain has certain properties.  The key property is that it has a
"registry".  That is, the domain has a list of the names of all
immediate subdomains, plus information about how to get to each one.
There is also a contact person for the domain.  This person is
responsible for the domain, keeping the registry up-to-date, serving as
a point of contact for outside queries, and setting policy requirements
for subdomains.  Each subdomain can decide who it will allow to have
subdomains, and establish requirements that all subdomains must meet to
be included in the registry.  For example, the "cb" domain might
require all subdomains to be physically located in the AT&T building in
Columbus.

ARPA has established certain requirements for top level domains.  These
requirements specify that there must be a list of all subdomains and
contact persons for them, a responsible person who is an authority for
the domain (so that if some site does something bad, it can be made to
stop), a minimum size (to prevent small domains from being top level),
and a pair of nameservers (for redundancy) to provide a
directory-assistance facility.  Domains can be more lax about the
requirements they place on their subdomains, making it harder to be a
top level domain than somewhere lower in the tree.  Of course, if you
are a subdomain, your parent is responsible for you.

One requirement that is NOT present is for unique parents.  That is, a
machine (or an entire subdomain) need not appear in only one place in
the tree.  Thus, "cb" might appear both in the "att" domain, and in the
"ohio" domain.  This allows domains to be structured more flexibly than
just the simple geography used by the postal service and the telephone
company, but organizations or topography can be used in parallel.
(Actually, there are a few instances where this is done in the postal
service [overseas military mail] and the telephone system [prefixes can
appear in more than one area code, e.g. near Washington D.C., and
Silicon Valley].)  It also allows domains to split or join up, while
remaining upward compatible with their old addresses.

Do all domains represent specific machines?  Not necessarily.  It's
pretty obvious that a full path like "d.CBOSG.ATT.UUCP" refers to
exactly one machine.  The OSG domain might decide that "CBOSG.ATT.UUCP"
represents a particular gateway machine.  Or it might decide that it
represents a set of machines, several of which might be gateways.  The
"ATT.UUCP" domain might decide that several machines, "ihnp4.UUCP",
"whgwj.UUCP", and "hogtw.UUCP are all entry points into "ATT.UUCP".  Or
it might decide that it just represents a spot in the name space, not a
machine.  For example, there is no machine corresponding to "ARPA" or
"UUCP", or to the root.  Each domain decides for itself.  The naming
space and the algorithm for getting mail from one machine to another
are not closely linked - routing is up to the mail system to figure
out, with or without help from the structure of the names.

The domain syntax does allow explicit routes, in case you want to
exercise a particular route or some gateway is balking.  The syntax is
"@dom1,@dom2,...,@domn:user@domain", for example,
@ihnp4.UUCP,@ucbvax.UUCP,:joe@NIC.ARPA, forcing it to be routed through
dom1, dom2, ..., domn, and from domn sent to the final address.  This
behaves exactly like the UUCP ! routing syntax, although it is somewhat
more verbose.

By the way, you've no doubt noticed that some forms of electronic
addresses read from left-to-right (cbosgd!mark), others read from
right-to-left (mark@Berkeley).  Which is better?  The real answer here
is that it's a religious issue, and it doesn't make much difference.
left-to-right is probably a bit easier for a computer to deal with
because it can understand something on the left and ignore the
remainder of the address.  (While it's almost as easy for the program
to read from right-to-left, the ease of going from left-to-right was
probably in the backs of the minds of the designers who invented
host:user and host!user.)  On the other hand, I claim that user@host is
easier for humans to read, since people tend to start reading from the
left and quit as soon as they recognize the login name of the person.
Also, a mail program that prints a table of headers may have to
truncate the sender's address to make it fit in a fixed number of
columns, and it's probably more useful to read "mark@d.osg.a" than
"ucbvax!sdcsv".  These are pretty minor issues, after all, humans can
adapt to skip to the end of an address, and programs can truncate on
the left.  But the real problem is that if the world contains BOTH
left-to-right and right-to-left syntax, you have ambiguous addresses
like x!y@z to consider.  This single problem turns out to be a killer,
and is the best single reason to try to stamp out one in favor of the
other.

		So why are we doing this, anyway?

The current world is full of lots of interesting kinds of mail syntax.
The old ARPA "user@host" is still used on the ARPANET by many systems.
Explicit routing can sometimes by done with an address like
"user@host2@host1" which sends the mail to host1 and lets host1
interpret "user@host2".  Addresses with more than one @ were made
illegal a few years ago, but many ARPANET hosts depended on them, and
the syntax is still being used.  UUCP uses "h1!h2!h3!user", requiring
the user to route the mail.  Berknets use "host:user" and do not allow
explicit routing.

To get mail from one host to another, it had to be routed through
gateways.  Thus, the address "csvax:mark@Berkeley" from the ARPANET
would send the mail to Berkeley, which would forward it to the Berknet
address csvax:mark.  To send mail to the ARPANET from UUCP, an address
such as "ihnp4!ucbvax!sam@foo-unix" would route it through ihnp4 to
ucbvax, which would interpret "sam@foo-unix" as an ARPANET address and
pass it along.  When the Berknet-UUCP gateway and Berknet-ARPANET
gateway were on different machines, addresses such as
"csvax:ihnp4!ihnss!warren@Berkeley" were common.

As you can see, the combination of left-to-right UUCP syntax and
right-to-left ARPANET syntax makes things pretty complex.  Berknets are
gone now, but there are lots of gateways between UUCP and the ARPANET
and ARPANET-like mail networks.  Sending mail to an address for which
you only know a path from the ARPANET onto UUCP is even harder -
suppose the address you have is ihnp4!ihnss!warren@Berkeley, and you
are on host rlgvax which uses seismo as an ARPANET gateway.  You must
send to seismo!ihnp4!ihnss!warren@Berkeley, which is not only pretty
hard to read, but when the recipient tries to reply, it will have no
idea where the break in the address between the two UUCP pieces
occurs.  An ARPANET site routing across the UUCP world to somebody's
Ethernet using domains locally will have to send an address something
like "xxx@Berkeley.ARPA" to get it to UUCP, then
"ihnp4!decvax!island!yyy" to get it to the other ethernet, then
"sam@csvax.ISLAND" to get it across their ethernet.  The single address
would therefore be ihnp4!decvax!island!sam@csvax.ISLAND@Berkeley.ARPA,
which is too much to ask any person or mailer to understand.  It's even
worse: gateways have to deal with ambiguous names like
ihnp4!mark@Berkeley, which can be parsed either "(ihnp4!mark)@Berkeley"
in accordance with the ARPANET conventions, or "ihnp4!(mark@Berkeley)"
as the old UUCP would.

Another very important reason for using domains is that your mailing
address becomes absolute instead of relative.  It becomes possible to
put your electronic address on your business card or in your signature
file without worrying about writing six different forms and fifteen
hosts that know how to get to yours.  It drastically simplifies the job
of the reply command in your mail program, and automatic reply code in
the netnews software.

For further information, some of the basic ARPANET reference documents
are in order.  These can often be found posted to Usenet, or available
nearby.  They are all available on the ARPANET on host NIC via FTP with
login ANONYMOUS, if you have an ARPANET login.  They can also be
ordered from the SRI Network Information Center in Menlo Park, CA.

RFC819  The Domain Naming Convention for Internet User Applications
RFC821  Simple Mail Transfer Protocol
RFC822  Standard for the Format of ARPANET Text Messages
RFC881  The Domain Names Plan and Schedule