[comp.protocols.time.ntp] NTP behaving badly on bad network

Martyn.Johnson@cl.cam.ac.uk (Martyn Johnson) (06/05/91)

Various people in the UK academic community are experimenting with NTP 
over our national academic network.  This is part of an experimental 
pilot project undertaken at a small number of sites, making use of IP 
over X.25.  We have IP connectivity with each other and with the Internet.

At present, the IP network is something of a lash-up, held together with 
general purpose computers rather than specialised routers (these will 
come later).  As a consequence, the network has some properties which 
make it rather unsuitable for time synchronisation. However, the 
implementation I am using (xntpd) behaves rather worse than I might 
expect.

When the network is quiet, all is fine.  However, when the network gets 
busy, the delays become long, variable and asymmetric, and the whole 
thing goes unstable, frequently resetting the local clock (sometimes by 
over a second).

As I understand it, the data filtering algorithm keeps the 8 most recent 
samples, and effectively uses the "best".  Since the network can get busy 
for hours at a time, with delays of the order of several seconds, chances 
are that none of these 8 samples will be any good.  If you do happen to 
get a really good one, it will only last about ten minutes before being 
thrown out of the shift register. Hence the offset estimate will jump 
about all over the place.

Now, I'm not expecting miracles. It seems to me impossible to extract any 
useful information from the sort of data I'm seeing.  The question is: 
why does it try?  The evidence would seem to suggest that the NTP daemon 
"believes" these bogus offsets, and is quite happy to use them to update 
the local clock.  If all the offsets were small, it wouldn't matter too 
much, because the local clock loop would damp out the changes.  But many 
of these offsets are large enough to make it replace the local clock 
value, reset everything and start again.  This means that the clock jumps 
around all over the place, when it would actually be much better to leave 
it alone, doing only the skew compensation based on data collected when 
the network was good.

I must confess to not having read and understood everything in the NTP 
specs yet. But I observe that there is a serious analysis of error bounds 
etc. Can anyone explain to me why data which is so obviously unreliable 
is being trusted?  Is it NTP itself, or the implementation at fault?

Does anyone have any general comments on this problem?

I hope it is a short-term problem.  The network should improve as links 
get faster and dedicated hardware is installed to do routeing.  Also I am 
working on interfacing xntpd to our radio clock, which will give me a 
good local time reference (though I would still like to feel that the 
network could act as a good backup).

Martyn Johnson      maj@cl.cam.ac.uk
University of Cambridge Computer Lab
Cambridge UK

Mills@udel.edu (06/19/91)

Martyn,

You can change certain parameters in the NTP daemons to
widen the aperture which the daemon believes as true time.
However, there are other spots where timing dispersion is
excessive, like in Norway. Experience there led to certain
modifications to the NTP local-clock model that should
help your case as well. Unfortunately, these mods are in the
NTP Version 3 specification, not in previous versions. NTP
v3 has been implmented in the fuzzball serves, but is not
yet available for Unix. There have been volunteers from among
this mangy bunch to implment v3 for Unix, but so far none
have barked.

Dave