[comp.protocols.time.ntp] A roaring in the swamps

Mills@udel.edu (01/20/91)
Folks,

I have recently noticed some degradation in the timekeeping quality
shown by many of the fuzzball primary NTP servers, specifically
at umd1.umd.edu, but to a lesser extent at all stratum-1 servers.
The degradation is not debilitating, from an expected accuracy
of maybe 20 ms to something like twice that, but this is still
a concern for precision measurements we would like to make with
DARTnet. 

The prolem is that some of these servers are being banged upon
by an incredible wash of alligators and the poor fuzz creatures
are building up significant traffic in their output queues. Some
idea of the situation can be gleaned from the following update
of a survey I last did some months ago. The table shows the
mean flux of NTP messages received per second for each of the
public NTP fuzzball servers, both at stratum-1 and stratum-2:

Stratum-1

umd1.umd.edu		2.59
truechimer.cso.uiuc.edu	2.07
ncarfuzz.ucar.edu	1.95
fuzz.sdsc.edu		1.72
wwvb.isi.edu		1.44
dcn1.udel.edu		1.34
dcn5.udel.edu		1.21

Stratum-2

lilben.tn.cornell.edu	0.77
clock.sura.net		0.55
libra.rice.edu		0.31
fuzz.psc.edu		0.38

While a flux of 2.59 packets per second might not sound like much,
this means there can e significant busy periods where the packets
all gang up at about the same time and clog the output queuee,
leading to artifically long transit times and degraded accuracy.

Obviously, accuracies can be improved with better load management,
specifically offloading the primary servers to the secondary ones,
which continue to be underutilized, as well as balancing the
loads on the primary servers. From occasional observations of
the various servers I continue to see many instances where more
than one campus server chimes with a single primary server, sometimes
up to several do this. While a case can be made for maybe two
campus servers to chime with the same primary server, in almost
all cases the accuracy and robustness of campus time is enhanced
to the max when the urge to pile all the campus chimers on the
same set of servers is successfully resisted. It is much better
to scatter the peers of up to three (not more) campus secondary
servers all on different primary servers.

A useful rule of thumb when designing NTP configurations is for
each campus server to peer with two primary servers and with
the other campus server(s) and with one secondary server from
a nearby campus or one of the NSFNET secondary servers. In fact,
the NTP subnet is so richly connected, especailly across the
NSFNET backbone, that the NSFNET secondary servers are just
about as solid in accuracy and robustness as the primary
servers. Accordingly, chimers might do just as well to chime the
secondary servers only. If that is done, chime only the secondary
servers and not the primary ones; otherwise, the selection
algorithm can be yanked by a single falseticking primary.

In the absence of an available Unix version-3 NTP daemon, I am
considering ways to relieve congestion at some of the primary
servers. Note that version-3 has been carefully crafted so that
accuracy can be maintained even when the poll intervals for
synchronized paths are as long as 17 minutes, so, obviously,
there is a considerable benefit to be gained by switching to
that version (hint for you software weekend warriors). One
of these may be limiting access to no more than two chimers
from the same net. Another may be limiting availability of the
UDP/TIME service to only the stratum-2 servers (abuse of the
primary servers with UDP/TIME continues unabated).

I have done one thing in order to improve accuracy for those
customers that need it (DARTnet), while resulting in only very
minor degradation for other users, by making use of the
precedence queueing features of the fuzzball. Taking into
account all the sanity checks, stratum assignments and crypto-
checksums as configured, all those customers that potentially
can synchronize the server itself are now inserted at the head
of the output queue, rather than at the usual end. Therefore,
if you run xntpd, enable cryptographic authentication, operate
at stratum-1 or -2 and show sufficiently low delay and dispersion,
then you will go to the head of the queue.

Initial tests of the new "features" indicate that the DARTnet
customers can enjoy sub-millisecond accuracies, while the
rest of you may lose a couple of milliseconds. If enough of
you can adjust your peers to equalize the loads, you will get
those precious milliseconds back.

Dave
DS