[net.dcom] Standards for commercial pac

jbn@wdl1.UUCP (08/28/85)

Datagram systems have some serious problems.  Here are a few of them.

    1.  In a pure datagram system, with no link-level retransmission, the
	probability of successfully forwarding a packet througn N nodes
	declines exponentially with the number of nodes.  Ham users of
	digipeaters and UNIX users of async links for IP datagrams are
	painfully aware of this phenomenon.  You really do need link-level
	retransmission in any sizable datagram system, unless the medium
	has very low error rates.

    2.  Congestion is a serious problem in datagram systems.  No really good
	general solutions are known.  I've solved some problems associated with
	some of the simpler cases (IP/TCP via Ethernet to slow link gateways)
	but a general solution is still elusive.  There are tough theoretical
	problems here; there may be a way to organize an arbitrarily large
	datagram network, but it hasn't been discovered yet.  Telephony
	has been around long enough that we know how to build very large
	virtual circuit networks.

    3.  Datagram networks tend to break down when fully loaded; this is a
	consequence of (2) above.  There are ways around this, but they
	involve running the system in a derated mode, where keeping all
	links as busy as possible is not attempted.  The ARPANET technology 
	really works only because the ARPANET has substantially more link
	bandwidth than it needs for its traffic volume; this is
	a well known problem.  TELENET started out with ARPANET technology
	but has since gone to virtual circuits internally to get better link 
	utilization.  In any case, the IMP system of the ARPANET is not
	a true datagram system internally, although it exports a datagram
	interface.

    4.  Datagram systems have some serious vulnerabilities.  One bad guy can
	hog the network and clog up the links.  Datagram systems tend to
	rely on hosts being well-behaved.  With virtual circuits, the network
        has a positive throttle over host traffic generation, and can keep
	bad hosts from interfering with other traffic.  In networks with
	no central administrative authority over hosts, this is a serious
	problem in practice.  The ARPANET/MILNET gateways are already
	under serious strain because of this exact problem.  Tight standards
	and anti-bad-guy queuing algorithms in nodes can solve this
	problem; unfortunately the Internet lacks both.

    5.  Accounting is difficult in datagram systems.  What should a phone
	bill for a datagram net look like?  Histograms of traffic by
	time and destination?  Just a total amount?  The network may need
	to recognize clumps of packets for similar destinations and treat
	them as a ``call'' for billing purposes.  

This may sound odd coming from me, as a builder of datagram gateways.  But
datagram systems are useful in the military environment, where the important
thing is to keep going despite serious failures, not achieve maximum
throughput under optimal conditions.  They may be useful for other
purposes, if the problems above are addressed.  But a simple virtual circuit
network (a la Tymnet) behaves better than a simple datagram network (a la
Internet) given the same bandwidth.


				John Nagle

karn@petrus.UUCP (Phil R. Karn) (08/30/85)

> Datagram systems have some serious problems.  Here are a few of them.
> 
>     1.  In a pure datagram system, with no link-level retransmission, the
> 	probability of successfully forwarding a packet througn N nodes
> 	declines exponentially with the number of nodes.  Ham users of
> 	digipeaters and UNIX users of async links for IP datagrams are
> 	painfully aware of this phenomenon.  You really do need link-level
> 	retransmission in any sizable datagram system, unless the medium
> 	has very low error rates.

Portions of a datagram network are free to use link level retransmission
whenever they consider it necessary. However, one of the beauties of
datagram networks is that they don't HAVE to use link level retransmission
where it doesn't make any sense (Ethernets, DDS lines with 1 in 10^9 error
rates, etc).

Packet radio (amateur or otherwise) is one of the few places where link level
acknowledgments really do make sense.


>     2.  Congestion is a serious problem in datagram systems.  No really good
> 	general solutions are known.  I've solved some problems associated with
> 	some of the simpler cases (IP/TCP via Ethernet to slow link gateways)
> 	but a general solution is still elusive.  There are tough theoretical
> 	problems here; there may be a way to organize an arbitrarily large
> 	datagram network, but it hasn't been discovered yet.  Telephony
> 	has been around long enough that we know how to build very large
> 	virtual circuit networks.

Congestion is a serious problem in any network that depends on well-behaved
user statistics, be it virtual circuit, datagram or simple circuit switched.
I could make a virtual circuit network go into "congestion collapse" just
like an IP network, assuming that I have a transport protocol in each case.
I merely set the reset-and-re-establish VC timer in my transport protocol to
be short enough that it frequently clears and re-establishes the underlying
VC, preferably ten or twenty times for each successful packet delivery.
Considering that most VC networks assume virtual circuit setups to be
rare events (some even have central nodes performing all circuit setup and
teardown operations) I think this could cause a lot of havoc.

Telephony has been around a long time, but we still don't know how to build
a large circuit switched network (virtual or otherwise) that isn't
susceptible to congestion collapse. Just see the notes on net.ham-radio
about what happened to phone service in Tucson AZ during the recent TAPR TNC
sale.  The only guaranteed way out is to have enough network resources for
the absolute worst possible case. In most long haul networks this is clearly
out of the question, so you just try to deal with it as best as you can.

>     3.  Datagram networks tend to break down when fully loaded; this is a
> 	consequence of (2) above.  There are ways around this, but they
> 	involve running the system in a derated mode, where keeping all
> 	links as busy as possible is not attempted.  The ARPANET technology 
> 	really works only because the ARPANET has substantially more link
> 	bandwidth than it needs for its traffic volume; this is
> 	a well known problem.  TELENET started out with ARPANET technology
> 	but has since gone to virtual circuits internally to get better link 
> 	utilization.  In any case, the IMP system of the ARPANET is not
> 	a true datagram system internally, although it exports a datagram
> 	interface.

I don't understand this comment. It is at least possible to re-route excess
traffic around a congested area when datagrams are used. If you have N
virtual circuits established through a given fixed route, there's not much
you can do if all N users decide to send simultaneously, overloading the
links along the route. Of course, you could statically allocate link
bandwidth and buffer space for each virtual circuit, something that is
difficult to do in a datagram network. However, this defeats the whole point
of packet switching, namely the statistical sharing of resources. If you
really want to guarantee throughput once a connection is established, build
a pure circuit-switched network; if you want the guaranteed ability to
establish a connection at any time, put in a leased line.

TELENET went to VCs internally for two reasons: a) they only had to provide
a virtual circuit service, X.25; b) the bulk of their traffic consists of
single character packets from people typing on dumb terminals. In this case
the larger datagram headers were the deciding factor.

>     4.  Datagram systems have some serious vulnerabilities.  One bad guy can
> 	hog the network and clog up the links.  Datagram systems tend to
> 	rely on hosts being well-behaved.  With virtual circuits, the network
>         has a positive throttle over host traffic generation, and can keep
> 	bad hosts from interfering with other traffic.  In networks with
> 	no central administrative authority over hosts, this is a serious
> 	problem in practice.  The ARPANET/MILNET gateways are already
> 	under serious strain because of this exact problem.  Tight standards
> 	and anti-bad-guy queuing algorithms in nodes can solve this
> 	problem; unfortunately the Internet lacks both.

Not unlike virtual circuit networks. The IP "source quench" is a protocol;
unfortunately many hosts refuse to adhere to it. I could also refuse to
adhere to X.25 and send traffic outside of my agreed-apon window, for
example, or I could (and do!) establish additional virtual circuits to my
destination to circumvent the much-touted per-VC network flow control
ability. The only answer in either case is to cut off hosts that don't play
by the rules, but this is an implementation problem, not a problem with the
protocols.

>     5.  Accounting is difficult in datagram systems.  What should a phone
> 	bill for a datagram net look like?  Histograms of traffic by
> 	time and destination?  Just a total amount?  The network may need
> 	to recognize clumps of packets for similar destinations and treat
> 	them as a ``call'' for billing purposes.  
> 

PDNs already charge for both connect time and for packets sent. (I've
sometimes suggested, only half in jest, that the real reason they don't like
datagram services is because they'd no longer be able to charge for connect
time.)  Since most datagram traffic would continue to be "clustered" to
a small set of destinations, I don't see any problem with billing by
per-destination packet counts in the local switch. TELENET punts the issue
anyway, since their charges are distance-independent.

I guess we don't really disagree on what needs to be done to make datagram
networks like the Internet behave well under loads as they grow.  My
suggestions are as follows:

1. Implement mechanisms to "punish" hosts that misbehave by ignoring ICMP
source quench messages.

2. Make sure that each packet switch has more than enough buffer memory to
handle all but extremely unusual peak traffic bursts. The older IMPs and IP
gateways are probably the major offenders in this regard. I suspect that
memory-starved IP gateways account for the vast majority of dropped
datagrams (ignoring causes such as unreachable destinations, of course.)

Regardless of the protocol, the laws of queuing theory still apply. If you
use an internal flow control mechanism to avoid dropping packets in a
memory-starved packet switch, you won't be able to utilize your outgoing
link as efficiently. The larger the queue on your outgoing link, the closer
you'll be able to approach 100% utilization.

3. Use link level acknowledgements only on those paths (radio, dialup
modems) that are unreliable enough to justify them. Better yet, do something
to improve the raw error rate on the links. Get rid of link acknowledgments
on all other paths to improve link efficiency.

4. Once the above steps are taken, the dropped packet rate should fall to
a very low value. Once this happens, it should be possible to convince TCP
implementers to lengthen their retransmission timers significantly to avoid
congestion collapse when round trip delays jump because of sudden load.
If you can send a datagram with a very high degree of confidence that it'll
get there (eventually), people won't be tempted to use such trigger-happy
retransmission timers.

Phil Karn

karn@petrus.UUCP (Phil R. Karn) (09/04/85)

Here's some more information about Telenet that might be interesting.  As I
understood an explanation given by one of their employees at an amateur
packet radio conference last fall, each of their packet switches appears as
a self-contained X.25 "network", and these switches speak to each other with
X.75.  (X.75 was intended as an "Internetwork" protocol for the
interconnection of X.25 networks owned by different operators, but it, like
the DoD "Internet" Protocol, can also be used internally within individual
networks as well.)

Once an initial connection is established, the packet switch translates the
VC identifier in each arriving packet to the proper identifier for the
correct outgoing link, and sends the packet.  This operation applies to flow
control (RR/RNR) as well as data packets. What this means is that regardless
of the setting of the D-bit, flow control in Telenet is done on an
end-to-end basis. (The "D", or "Delivery Confirmation" bit is supposed to
control whether X.25 DCE packet level acknowledgements indicate that the
packet has been accepted by the network or whether it has actually been
delivered to the other DTE.)

The result of this is that you can never have more than W (the window size,
typically 2) packets in flight at any one time on each virtual circuit.  The
carrier likes this, since it alleviates his congestion problems, but the
user hates it because it puts an upper bound on his throughput. This is also
the reason why the CSNET software is forced to open multiple, parallel
virtual circuits in order to get reasonable throughput. Of course, only
those of us who run datagram protocols above X.25 are able to pull this
trick; everybody else has to put up with lousy throughput.

Once this is done, I don't see any easy way for the network to control
potential congestion if resources are limited. Maximizing user throughput
and preventing network congestion seem to be in fundamental opposition,
and virtual circuit network protocols are no panacea.

Phil

jbn@wdl1.UUCP (09/05/85)

     Karn's comments are good.  There are ways to abuse virtual circuit
networks too.  But they tend not to happen by accident, and existing calls
are usually not interfered with when the call-setup process is bottlenecked.
You can build virtual circuit systems which provide an assured level of
service if you can get any service (i.e. can place a call) and we don't 
know how to do this for datagrams yet.
     I'm writing a paper on queueing in datagram networks which will have
a new solution to part of the congestion problem, one good enough to keep
things going in the presence of obnoxious hosts.  When I get it done, I 
will post it here, as well as submitting it for publication.

					John Nagle

ch@gipsy.UUCP (Christian Huitema) (09/10/85)

Please don't try to revive the old VC/D.gram polemic! Refer to the
litterature instead.

The big difference between VC & D.grams, from a commercial point of view, is
the possibility to guarantee a certain level of "quality of service". During 
the call set-up phase, buffers can be reserved in the intermediate nodes for 
the virtual circuit; it is even possible to reserve some part of the
transmission ressource (An extreme case is the simulation of "physical
circuits" by TDMA satellite). Obviously, the "per duration" charge derives
from the amount of ressources that were reserved. This charge is generally
*small*. (Transpac charges FF0.02, i.e. $0.0018, per minute of connection,
for a 1200bit/s national connection). The worst "per duration" charges are
encountered on international connections.

The other difference is the ability to avoid congestion due to
"transmission" overload. A typical PSDN operates at a load of 40-50% per
link during the peak hours. It is possible to block the incoming packets; a
user that would try to ignore the window limitations would get his packets
rejected (reinitialisation cause "remote protocol error"). It is also
possible to choose the "best route" at call set up time, thus avoiding the
"congested areas". The same behaviour is not recommanded on a Datagram
network, as it tends to propagate congestion.

Still, VC networks can be congested, just like Telephone networks, by an
excess of calls. That was the reason of Transpac "black friday", last june.
However, flow control procedure have been already implemented on some
telephone networks, and could be ported onto PSDNs.

At INRIA we experimented with LAN-satellite connections; the first
design used a datagram based gateway for "transparent" interconnection.
However we found out that the efficiency was poor, as the gateway had to
trow away packets when the load increased. Thus, in the next design, we have
implemented X25 on top of Ethernet, which allows for an easy interconnecton
to the outer world, and for a much more efficient usage of external
connection. This will not cause an undue overload for local communications,
as we can use the "class 0" transport protocol, i.e. no end to end
acknowledgments. X25 allows you to optimize windows and packet sizes on each
subnetworks.

The last, but not the least, advantage of X25 is that all public data
networks are interconnected, and that one can establish a direct connection
with virtually any computer in the world.

karn@petrus.UUCP (Phil R. Karn) (09/17/85)

> Please don't try to revive the old VC/D.gram polemic! Refer to the
> litterature instead.

Actually, I think the real issue here is whether you should have some form
of resource preallocation in your network. This affects the datagram/VC
choice, because VC networks allow for preallocation at circuit setup time
while datagram nets can't.

I've done a lot of thinking about this issue lately and am fast coming to
the conclusion that if you REALLY need guaranteed bandwidth after "circuit
setup", then you want an ordinary circuit-switched network, not packet
switching.  If your traffic has a peak-to-average ratio near 1 for long
periods of time, or if you're willing to pay extra for reserve idle
bandwidth, then circuit switching is clearly superior to any form of packet
switching, be it datagram or virtual circuit.

Packet switching is meant to statistically multiplex for transmission
traffic from a collection of users whose individual requirements are
unpredictably bursty.  As you pool more and more users, however, the law of
large numbers means that the aggregate traffic becomes more and more
predictable. Therefore as public data networks grow and their links increase
in bandwidth, I think that the need for "preallocation" will decrease
considerably.

Preallocation of buffer space is, I think, an obsolete issue. Maybe this was
important at one time when memory was expensive, but now there's little
reason why packet switches cannot be given so much memory that they almost
never have to drop packets or invoke congestion control. Delays may get
large, but only if there isn't enough transmission bandwidth to go around.

> At INRIA we experimented with LAN-satellite connections; the first
> design used a datagram based gateway for "transparent" interconnection.
> However we found out that the efficiency was poor, as the gateway had to
> trow away packets when the load increased.

Only because you didn't have enough buffer space in your gateways.

> Thus, in the next design, we have
> implemented X25 on top of Ethernet, which allows for an easy interconnecton
> to the outer world, and for a much more efficient usage of external
> connection. This will not cause an undue overload for local communications,
> as we can use the "class 0" transport protocol, i.e. no end to end
> acknowledgments. X25 allows you to optimize windows and packet sizes on each
> subnetworks.

I'm not willing to live dangerously and trust assurances of reliability from
any network, even Ethernet. Therefore I'm not willing to give up using a
transport protocol with end-to-end acknowledgements, like TCP.  Worrying
about protocol overhead on a 10 mb/s LAN, where the minimum packet size
is 60 bytes anyway, is silly.

The best way to improve efficiency (i.e., reduce the effects of protocol
header overhead) on ANY network is to send fewer, larger packets.  This will
have a much greater effect than trying to trim down the header sizes,
because it will also reduce packet switch CPU requirements.

> The last, but not the least, advantage of X25 is that all public data
> networks are interconnected, and that one can establish a direct connection
> with virtually any computer in the world. 

The same is true of the public telephone network, but it doesn't mean
that it's ideal for my purposes.

Phil

jbn@wdl1.UUCP (10/01/85)

      ``Lumpiness'' is a sign of proper adaptation to overload.  The 
alternative, given the same bandwidth resources,  is falling further 
and further behind as you send more and more tiny packets.  Try two 4.2BSD
systems connected via an overloaded net for comparison.  Obviously it's
better to have the bandwidth, but lumpiness is far better than continually
losing ground.  Or would you rather have the keyboard lock when you get
too far ahead, as with the old IBM 2741?

				John Nagle