[net.unix-wizards] UNIX IPC Datagram Reliability under

berry@zehntel.UUCP (01/17/84)

#R:allegra:-220500:zinfandel:12400044:000:283
zinfandel!berry    Jan 16 12:47:00 1984

4.2 UNIX domain datagram service is 'unreliable' in the sense that
datagrams are not GUARANTEED to get to their recipient.  Overall, they are
in fact quite reliable until your Ethernet gets really loaded.

Berry Kercheval		Zehntel Inc.	(ihnp4!zehntel!zinfandel!berry)
(415)932-6900

andree@uokvax.UUCP (01/22/84)

#R:allegra:-220500:uokvax:6200010:000:374
uokvax!andree    Jan 20 21:38:00 1984

Gee, that's interesting. Non-guaranteed datagram service in Unix would fit
well with the Unix philosophy ("all the rope you need, and you can hang
yourself if you want to"), and has been done before (the ameoba (sp?) system).

I *hope* they did it that way in the kernel. I also hope they provided a
library that DOES DO reliable transmission - in multiple flavors.

	<mike

rpw3@fortune.UUCP (01/31/84)

#R:allegra:-220500:fortune:11600049:000:4908
fortune!rpw3    Jan 31 02:26:00 1984

[This lengthy tutorial probably belongs in net.arch, but the discussion
has been here so far.]

O.k., nobody has come forth to defend "UNIX domain datagrams", so here it is...

	>>> Why datagrams SHOULD be "unreliable". <<<

The internet datagram "style" is based on the observation that
the end processes in any communication have to be ultimately
responsible for "transaction integrity" so they might as well be
resonsible for all of it. No amount of intermediate error checking and
retransmission can GUARANTEE reliable synchronization if the ultimate
producer and consumer do not do the handshake. The layers on layers
of protocols don't hack it, if the critical state is outside the end
process. Nodes can crash; links can crash; nodes and links can go down
and up. Servers (e.g. mail) still have to do their own ultimate lost
message and duplication checking.  (I will not argue that point
further. If you disagree, go see your local communications wizard and
get him/her to explain.) (Also, a moment of silence for anyone who
thinks X.25 is a "reliable" protocol.)

Given that the responsibility for ultimate error correction lies in the
end-point processes, the transmission and switching portion of the net
can get A LOT cheaper and simpler. Instead of trying (vainly) to GUARANTEE
that no data is lost (with the attendant headaches of very careful buffer
management, flow-control, load shedding, load-balancing, re-routing,
synchronizing, etc.), in the internet datagram style (DoD IP, Xerox NS, etc.)
the transmission system makes a "good effort" to get your packet from
here to there. The only thing that IS demanded is that the probability
of receiving a bad (damaged) packet that is claimed to be good should
be VERY small. (Since that is a one-way requirement, it's fairly easy.)

So if the packet has a bit error, throw it away; if the outgoing queue
won't hold the packet, throw it away (that line's overloaded anyway);
if the route's not valid anymore, toss it. Somebody (the end process)
will try again soon anyway. (Two notes: 1. It is considered polite
BUT NOT NECESSARY to send an error packet back, if you know where "back"
is; and 2. if the system is to be generally considered usable, the
long-term error rate should be less than 1%, although short-term losses
of 10% or more don't hurt anything.)

This seemingly cavalier attitude results in ENORMOUS savings in complexity,
memory, and CPU ticks for the intermediate nodes, which merely make a
(good but not perfect) attempt to throw the packet out the next link.
Packet switching rates of several hundred to several thousand per second
are easily attainable with cheap micros. The routers don't have to have
any "memory" (other than the routing tables). They are not responsible
for "connections", or "re-transmissions", or "timeouts". They don't know
a terminal from a file (since they don't know either!).

Secondly, the CPU/memory load of handling the connections/retransmisions/etc.
is spread out where there is lots of resources -- at the end points. The
backbone nodes just move data, so they can move lots of it. (Think of a
hundred IBM PC's using your VAX to move files back and forth. Who do you
want to do the busy work, the VAX or the PC's?)

Thirdly, the end process always had to do about 70-90% of the work anyway,
duplicating the work the network was doing (and sometimes triplicating the
work that the kernel was duplicating, on top of that); the added 30-10%
is easily justfied by the savings in the net (or in the kernel, if we are
talking about process-to-process on a single host -- I didn't forget).
The total number of CPU ticks on an end-point processor can even go DOWN,
because of the smaller number of encapsulations (layers) packets have to go
through. (In the simplest case, there are only three layers: client, datagram
router or internet, and physical.)

Lastly, there are some applications (voice, time-of-day) where you do not
want the network trying to "help" you. A voice packet that is perfect but
is late because it got retransmitted might as well have been lost -- it's
useless. Ditto time-of-day.

(whew! is it soup yet?)
 
So "unreliable" when talking about datagrams means "not perfect",
and is a desirable attribute. Desirable, since the cost of "reliability"
is very high and the goal illusary in any case. On a single processor,
it makes sense sometimes to have other (reliable) inter-process primitives
besides datagrams, if (1) throughput is paramount and (2) the set of
cooperating processes will NEVER be distributed. But the overhead of
handling the "retransmission" can be made small (and processes DO die
sometimes, even on uni-processors), so the argument for "reliable" IPC
is weaker than most people think.

Rob Warnock

UUCP:	{sri-unix,amd70,hpda,harpo,ihnp4,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphins Drive, Redwood City, CA 94065

ka@hou3c.UUCP (Kenneth Almquist) (02/04/84)

In response to Rob Warnock's defense of unreliable datagrams:

I have nothing against networks (like the ARPANET) that are based upon
unreliable datagrams, but I don't like the idea of lots of user
programs using unreliable datagrams.  They force each user to provide
for retransmission of lost packets, so there will be a lot of
reinventing the wheel.  And since testing the effectiveness of error
recovery schemes is rather difficult, probably many of these
retransmission schemes will be tested inadequately or not at all.  The
result is likely to be programs that work most of the time but suffer
occasional mysterious failures.

Ideally I would like an interprocess communication mechanism that would
handle all errors for the user.  It is not possible for any
communications protocal to recover if either the data network or the
program being communcated dies, so the application program must be
informed of these errors and must be programmed to deal with them; but
I would require the protocal to recover from other errors types of
errors.  Three possible objections to this:

	1)  It is unimplementable.
No.  Stream sockets meet these requirements.

	2)  It is too inefficient to implement.
I believe that some of the most heavily used protocals on the
Arpanet, such as smtp (mail) and ftp (file transfer) are built on top
of tcp rather than being built directly on ip (unreliable datagrams),
so presumably the Arpanet developers find the cost of doing so
acceptable.  When the two processes are on one machine it is easy to
make communication reliable, and it's also more efficient to do
so than to have programs constantly setting alarm signals to determine
whether the programs they are talking to have gone away.  The cost of
setting up a connection probably makes stream sockets unacceptable for
some applications, but that is a problem with streams rather than with
the concept of reliable communications.  The V operating system, for
example, provides efficient interprocess communication without
requiring a "connection" to be established first.

	3)  Retransmission is not good for real time applications such
	    as voice or time of day.

True, but UNIX is not good for real time applications either.  Nor are
unreliable datagrams necessarily good either; they may arrive out of
order after arbitrary delays.
					Kenneth Almquist


P. S.  While I'm not an expert on X.25, I've never heard anyone call it
unreliable before.  It doesn't have dynamic routing on a per message
basis, so if a link goes down anybody who has virtual circuts over the
link looses them, if that's what you mean.  However, a protocal does
not need end to end acknowledgements to avoid losing data.  The level 2
protocals on each link ensure that no data is lost.  If a node crashes
then of course any data in its memory is lost but that is irrelevant
since the virtual circut has been lost anyway.

rpw3@fortune.UUCP (02/07/84)

#R:allegra:-220500:fortune:11600052:000:2831
fortune!rpw3    Feb  6 20:41:00 1984

In response to Ken Almquist:

1. As Eric said, if you want reliable UNIX domain communication, use
   streams, which provide that service. IP datagrams are by definition
   unreliable. [But... if you are using UNIX domain as IPC, see below.]

2. In general, what makes good IPC within a single (tightly-coupled
   multi-)processor does not make a good (loosely-coupled network)
   message system, and vice-versa.  The tradeoffs are too different.
   Trying to use one where the other fits is awkward at best. (Microsecond
   busy-waits are often appropriate in the former case; sleeps of seconds
   or minutes in the latter.)

   A good fast low-overhead IPC can sometimes be one of the mechanisms
   on which networking is built (see the CMU "Hydra/C.mmp" book), but
   even so one must be careful about synchronization and queueing. 
   (It is not yet clear to me that S-5 IPC is fast/cheap enough.)
   Simple/clean IPC is rarely achieved on top of networking (but
   see the literature on "Remote Procedure Calls".)

   When the reason for having multiple processes in the first
   place was to "solve" some synchronization/event-waiting problem
   with multi-programming WITHIN a single process, the lack of adequate
   sync/event mechanisms often comes back to bite the IPC user.
   Whether it's "software interrupts on events" or the 4.2 "select",
   SOME form of "tell-me-about-the-first-interesting-event" seems
   necessary for real concurrency. Having a forest of processes,
   each waiting on one specific event, is useful only if the process
   sleep/wake time is VERY small, and efficient fine-grained locks
   on shared-memory data are available. (Again, see the "Hydra" book,
   conversely also see Holt's "Concurrent-Euclid/UNIX/Tunis".)

   I guess I'm trying to say, "Look again at why you're using UNIX
   domains at all." Preparation for future networking? [O.k., fine]
   ... or as a type of IPC? If IPC, would you rather have some other kind?
   [Nothing wrong with "making do", but one needn't celebrate the crutch.]

3. A close reading of the X.25 standard will reveal that a "RESET" message
   is permitted from the network or either party at any time, with no
   synchronization with the other party. (Remember, X.25 is a network
   ACCESS protocol, not a peer-peer protocol.) "RESET" does NOT drop
   the connection, it just resets ("drops") the sequence numbers. This
   can cause data to be lost and/or duplicated unless a higher level
   stream protocol (with its own sequence numbers) is used on top of
   X.25 connections.  Networks may issue "RESET" at any time, such as
   when load-shedding data to relieve buffer congestion.

Rob Warnock

UUCP:	{sri-unix,amd70,hpda,harpo,ihnp4,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphins Drive, Redwood City, CA 94065

rpw3@fortune.UUCP (02/07/84)

#R:allegra:-220500:fortune:11600053:000:1040
fortune!rpw3    Feb  7 00:45:00 1984

+--------------------
| From:  Peter Vanderbilt <pv.usc-cse@rand-relay>
| 
| It would be convenient to be able to reliably send one message
| without going through the trouble (and overhead) of setting up
| a stream.
+--------------------

That's what the Xerox NS "Packet Exchange" protocol is all about.

Even so, you should be careful to set up your services to be idempotent.
Since any given packet may be lost/duplicated, it should be o.k. for a
given request (with the same tag/cookie/handle) so be acted on more
than once. Note that UNIX read's and write's have this property, if the
file offset ["lseek" pointer] is included in the request, and you don't
do a second operation 'til the first has succeeded.  (Unfortunately,
"open" and "close" are harder, without additional state in the server
that the simple protocol was designed to avoid in the first place.)

Rob Warnock

UUCP:	{sri-unix,amd70,hpda,harpo,ihnp4,allegra}!fortune!rpw3
DDD:	(415)595-8444
USPS:	Fortune Systems Corp, 101 Twin Dolphins Drive, Redwood City, CA 94065

beau@sun.uucp (Beau James) (02/07/84)

With respect to the reliability of the X.25 protocol ...

The X.25 protocol set (level 3, packet; level 2, link; level 1, electrical)
was designed for a network whose general structure looks like

	(DTE) ---- (DCE) .............. (DCE) ---- (DTE)

According to the CCITT spec, X.25 specifies the *interface* to the packet
transmission network - the "----" link in the diagram.  What happens to
packets inside the transmission network on their way to the remote end
DTE is not specified.  The specification does not even tell you how
to implement a DCE, although most of the differences are obvious.

The unreliabilities arise from the fact that there is no part of X.25
specification that is assured to be end-to-end between the DTEs that
have the open virtual circuit.  The CCITT spec specifically leaves that
decision up to the implementors of the packet data network (PDN)
(assumed to be the government communication agencies (PTTs) everywhere
but in the U.S.).

Many control messages can cause duplicate packets end-to-end because
they may be generated locally.  (E.g. DTE RESET: only one end of
the connection may be reset; if so, and that DTE resends unacknowledged
packets, the "remote" DTE may see the same data twice.)  Even the
ordinary data packet acknowledgement scheme is not reliable, since the
DCE may acknowledge successful transmission locally, meaning across the
DTE/DCE interface.  If the data does not get to the remote DTE for any
reason, there is no mechanism for determining which packets got lost.
Not very useful, but all according to the standard.

This is not to say that the X.25 protocols cannot be USED in a reliable,
end-to-end network design, IF the implementor ensures that all the X.25
acknowledgement and control messages have end-to-end significance.  The
Data General Xodiac(TM) network uses X.25 virtual circuits as end-to-end
session connections over several different transmission media, for
example.  But when a PDN is the "transmission medium", the virtual circuits
are not necessarily reliable (it depends on the PDN).  In the final
analysis, each top-level service protocol (mail, file transfer, etc.)
has to provide its own end-to-end reliability, if it cares.

						Beau James
						Sun Microsystems, Inc.

kre@mulga.SUN (Robert Elz) (02/13/84)

There's nothing remarkable about datagrams not being reliable under
unix.  No output is reliable.  You're not guaranteed to get an error
if the disk that you happen to be writing on developes a coughing fit
just at the time you do your write, you're not guaranteed to get
an error if the process at the other end of a pipe dies before reading
your data (though you will if its already dead when you send it),
nor will you get an error if your output to a terminal is mangled
by noise on the phone, or simply by some super-user type doing a "wall"
at the relevant time.

Why should unix datagrams do something different?
(As has been mentioned before, they would also cease to be
datagrams by the traditional definition.)

You will often get an error indication if something is wrong, but
you may not.

I do agree though that for most programs, datagrams are not an intelligent
service to use.

As to X.25, that can certainly not be considered to be reliable.
DCE's are permitted to RESET a virtual circuit as often as they
deem fit, and each RESET may cause data to be lost.  Higher level
protocols are necessary to guarantee data integrity.

Robert Elz,		decvax!mulga!kre