[comp.sys.sgi] Timeslave question

ktl@wag.caltech.edu (Kian-Tat Lim) (02/15/91)

We are using timeslave to track an NTP primary.  Assuming that the
'err' column in the debugging output is in units of milliseconds, we
seem to be off by as much as 1/4 second at times, both fast and slow,
though we're usually under 150 ms.  I have two questions:

1) Why doesn't timeslave do a better job?  The NTP host appears to
adjust its time by steps of 10 ms, as well as adjusting its clock
rate.  timeslave seems to have a hard time following the 10 ms jumps.
Should I be fiddling with some of the timeslave options to get better
tracking?  (It doesn't look like there's any control over the internal
filtering algorithm, unfortunately.)  I have also observed oscillatory
behavior, in which timeslave will overshoot the zero error point
(often by as much as it was off before) over and over.

2) Is there a bug in the timetrim computation?  It seems strange that
negative drifts (and -xx/yy sec/hr claims) end up with positive
timetrim values.

Maybe I should just give up and install xntp, but I was trying to get
decent time without the headaches of using yet another
vendor-unsupported package.

Here's an edited portion of our SYSLOG with -USR1 debugging:

Feb 13 16:39:36 sgi1 timeslave:     err exp-err     adj  drift    var
Feb 13 16:39:36 sgi1 timeslave:   -19.0   -17.8  -14.82 -0.058    7.4 b

[Constant at about -19 until:]

Feb 13 16:56:16 sgi1 timeslave:    -1.0   -19.3  -14.75 -0.058   18.3 b

[Wanders between -1 and -5 for a bit.  Why +1495 here?]

Feb 13 17:32:20 sgi1 timeslave: -0.206/137.39 or -0.220/1.01 sec/hr;
			 set timetrim=+1495 or -60368?

[Jump of +9.0]

Feb 13 18:17:53 sgi1 timeslave:    +4.0    -4.7  -14.65 -0.058   10.5 b

[Wanders around +4 for almost an hour, then steps up again]

Feb 13 18:54:14 sgi1 timeslave:   +13.0   +12.5  -13.48 -0.058    9.9 b

[And again, constant at about +13 till this step:]

Feb 13 19:42:26 sgi1 timeslave:   +23.0   +12.5  -12.62 -0.057   10.5 b

[Two more steps to +33 and +48 a few hours later, then a slow ramp to:]

Feb 14 04:18:54 sgi1 timeslave:   +90.0   +84.9   -4.16 -0.039    6.9 b
Feb 14 04:22:53 sgi1 timeslave:   +91.0   +84.7   -3.86 -0.039    7.0 b
Feb 14 04:26:47 sgi1 timeslave:   +92.0   +85.7   -4.26 -0.039    6.9 b

[Another step:]

Feb 14 04:30:54 sgi1 timeslave:  +102.0   +86.6   -3.69 -0.038   15.4 b
Feb 14 04:34:54 sgi1 timeslave:  +103.0   +96.4   -3.77 -0.038   14.7 b

[And another slow ramp up to:]

Feb 14 07:42:35 sgi1 timeslave:  +131.0  +123.2   +1.39 -0.025    9.8 b
Feb 14 07:46:44 sgi1 timeslave:  +132.0  +123.2   +1.94 -0.025    9.8 b

[Step:]

Feb 14 07:50:39 sgi1 timeslave:  +142.0  +124.1   +2.04 -0.025   17.9 b
Feb 14 07:50:39 sgi1 timeslave: +0.044/151.70 or +0.052/1.02 sec/hr;
			 set timetrim=-1650 or +2428?

[More ramp:]

Feb 14 09:11:52 sgi1 timeslave:  +154.0  +145.0   +4.54 -0.018   14.5 b
-- 
Kian-Tat Lim (ktl@wag.caltech.edu, KTL @ CITCHEM.BITNET, GEnie: K.LIM1)

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (02/15/91)

In article <1991Feb14.174728.16734@nntp-server.caltech.edu>, ktl@wag.caltech.edu (Kian-Tat Lim) writes:
> We are using timeslave to track an NTP primary.  Assuming that the
> 'err' column in the debugging output is in units of milliseconds, we
> seem to be off by as much as 1/4 second at times, both fast and slow,
> though we're usually under 150 ms.  I have two questions:

It's all milliseconds.  250 msec seems rather high.  Is the system
suffering kernel printf's?

> 1) Why doesn't timeslave do a better job?  The NTP host appears to
> adjust its time by steps of 10 ms, as well as adjusting its clock
> rate.  timeslave seems to have a hard time following the 10 ms jumps.
> Should I be fiddling with some of the timeslave options to get better
> tracking?  (It doesn't look like there's any control over the internal
> filtering algorithm, unfortunately.)  I have also observed oscillatory
> behavior, in which timeslave will overshoot the zero error point
> (often by as much as it was off before) over and over.

Timeslave assumes its target is perfect.  It was written originally to sync
the SGI network to another company's with a satellite receiver, across a
9600 b/s Cypress link.  The link was heavily loaded, esp.  by netnews.
While loaded, network delays could be assymmetric by as much as 4 seconds.

If the measurements look good enough for long enough, a jump will look like
network problems and be discarded.  If the difference persists, timeslave
will start opening the filter.

"Filter" is doubtless too fancy a word for how timeslave tries to partition
the measured error among long term drift, short term jumps, and errors in
the measurements.  At each adjustment, it changes the clock by one 32nd of
accumulated differences between the expected and measured errors of the
last 32 measurements (using 32 buckets to get the exact sum of the last 32
measurements) plus the current estimate of the drift.  The drift is
estimated by summing the last 8 hours of adjustments to the local clock.

> 2) Is there a bug in the timetrim computation?  It seems strange that
> negative drifts (and -xx/yy sec/hr claims) end up with positive
> timetrim values.

The timetrim values are simply the total of all adjustments divided by
total elapsed time, and the same for that last 24 hours.  The difference in
the short and long term values suggest that strange things happened in the
past.  Kernel printf's commonly mess up time on IRIS's because the disable
all interrupts.  Disk errors, disk full, and late collision message are
common culprits.  I've heard reports of difficulties with *ntp on trashed
networks from exactly this sort of problem.

There were bugs in the timetrim computations, fixed I think in 3.3.2.
As I recall, it used int's and suffered over or underflow.

> Maybe I should just give up and install xntp, but I was trying to get
> decent time without the headaches of using yet another
> vendor-unsupported package.

If you want really good time, that might be a good idea.  Increasing the
measurement rate to '-r 15' would probably make timeslave much more
accurate at modest cost.

I've seen nearby *ntp machines claiming to be 3 msec from UTC, but
according to my measurements using ICMP timestamps close to a second away.
I've inferred that that the accuracy reported by the *ntp deamon is that
which would be achieved if only the deamon could adjust the computer's
clock as it wished.  In other words, ntp appears to report as its accuracy
the difference between the measured error and the predicted error.  Is this
a correct inference?

Vernon Schryver,   vjs@sgi.com

aspgpas@cidsv01.cid.aes.doe.CA (Peter Silva) (02/19/91)

|> > We are using timeslave to track an NTP primary.  Assuming that the
|> > 'err' column in the debugging output is in units of milliseconds, we
|> > seem to be off by as much as 1/4 second at times, both fast and slow,
|> > though we're usually under 150 ms.  I have two questions:
|> 
|> It's all milliseconds.  250 msec seems rather high.  Is the system
|> suffering kernel printf's?
|>

We run a couple of dozen Irises and have the same problem.  It seems especially
slow to sync up after systems have been down for a while (after a morning
reboot, (no, it wasn't down any amount of time before or after) the station is
still 1500 ms off at 4:30pm). 

|> > Maybe I should just give up and install xntp, but I was trying to get
|> > decent time without the headaches of using yet another
|> > vendor-unsupported package.
|> 
|> If you want really good time, that might be a good idea.  Increasing the
|> measurement rate to '-r 15' would probably make timeslave much more
|> accurate at modest cost.
|>

I tried  (not too hard) building xntp, but it absolutely wanted the 
"tickadj" kernel variables to exist.  Is there a version hacked for the IRIS? 
I've been using ntp on two stations for two months, on the same lan, 
configured to go to the same servers, (as a test).  One of the 
stations is un-loaded, the other has me pounding away on it all day.  
ntp runs, and complains about the lack of tickadj, but seems to have
a "plan B" to deal with it.

The results vary wildly from week to week.  Sometimes both stations are within
a millisecond or two of whatever stratum 1 server it's picked, sometimes it's
off by as much as a hundred milliseconds.  And, of course, when it gets that
far off, it starts polling every 64 seconds... AARRRGGHH!
My checks are just using ntpdc.

My station is the one that is usually goes nuts, but at the moment the other
one is nuts too because it's been down awhile, and is still syncing up.  My 
station usually has wildly high Dispersion numbers too, of course (but today 
they seem almost reasonable, (nothing over 300).

|> I've seen nearby *ntp machines claiming to be 3 msec from UTC, but

I don't think I'm getting more than double or triple the precision of 
timeslave at the moment.   Why is still a mystery.

I won't trust it until I know why this stuff happens.


Peter Silva			OS Support 
psilva@cid.aes.doe.ca		Dorval Computing Centre
(514) 421-4692			Atmospheric Environment Service