ktl@wag.caltech.edu (Kian-Tat Lim) (02/15/91)
We are using timeslave to track an NTP primary. Assuming that the 'err' column in the debugging output is in units of milliseconds, we seem to be off by as much as 1/4 second at times, both fast and slow, though we're usually under 150 ms. I have two questions: 1) Why doesn't timeslave do a better job? The NTP host appears to adjust its time by steps of 10 ms, as well as adjusting its clock rate. timeslave seems to have a hard time following the 10 ms jumps. Should I be fiddling with some of the timeslave options to get better tracking? (It doesn't look like there's any control over the internal filtering algorithm, unfortunately.) I have also observed oscillatory behavior, in which timeslave will overshoot the zero error point (often by as much as it was off before) over and over. 2) Is there a bug in the timetrim computation? It seems strange that negative drifts (and -xx/yy sec/hr claims) end up with positive timetrim values. Maybe I should just give up and install xntp, but I was trying to get decent time without the headaches of using yet another vendor-unsupported package. Here's an edited portion of our SYSLOG with -USR1 debugging: Feb 13 16:39:36 sgi1 timeslave: err exp-err adj drift var Feb 13 16:39:36 sgi1 timeslave: -19.0 -17.8 -14.82 -0.058 7.4 b [Constant at about -19 until:] Feb 13 16:56:16 sgi1 timeslave: -1.0 -19.3 -14.75 -0.058 18.3 b [Wanders between -1 and -5 for a bit. Why +1495 here?] Feb 13 17:32:20 sgi1 timeslave: -0.206/137.39 or -0.220/1.01 sec/hr; set timetrim=+1495 or -60368? [Jump of +9.0] Feb 13 18:17:53 sgi1 timeslave: +4.0 -4.7 -14.65 -0.058 10.5 b [Wanders around +4 for almost an hour, then steps up again] Feb 13 18:54:14 sgi1 timeslave: +13.0 +12.5 -13.48 -0.058 9.9 b [And again, constant at about +13 till this step:] Feb 13 19:42:26 sgi1 timeslave: +23.0 +12.5 -12.62 -0.057 10.5 b [Two more steps to +33 and +48 a few hours later, then a slow ramp to:] Feb 14 04:18:54 sgi1 timeslave: +90.0 +84.9 -4.16 -0.039 6.9 b Feb 14 04:22:53 sgi1 timeslave: +91.0 +84.7 -3.86 -0.039 7.0 b Feb 14 04:26:47 sgi1 timeslave: +92.0 +85.7 -4.26 -0.039 6.9 b [Another step:] Feb 14 04:30:54 sgi1 timeslave: +102.0 +86.6 -3.69 -0.038 15.4 b Feb 14 04:34:54 sgi1 timeslave: +103.0 +96.4 -3.77 -0.038 14.7 b [And another slow ramp up to:] Feb 14 07:42:35 sgi1 timeslave: +131.0 +123.2 +1.39 -0.025 9.8 b Feb 14 07:46:44 sgi1 timeslave: +132.0 +123.2 +1.94 -0.025 9.8 b [Step:] Feb 14 07:50:39 sgi1 timeslave: +142.0 +124.1 +2.04 -0.025 17.9 b Feb 14 07:50:39 sgi1 timeslave: +0.044/151.70 or +0.052/1.02 sec/hr; set timetrim=-1650 or +2428? [More ramp:] Feb 14 09:11:52 sgi1 timeslave: +154.0 +145.0 +4.54 -0.018 14.5 b -- Kian-Tat Lim (ktl@wag.caltech.edu, KTL @ CITCHEM.BITNET, GEnie: K.LIM1)
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (02/15/91)
In article <1991Feb14.174728.16734@nntp-server.caltech.edu>, ktl@wag.caltech.edu (Kian-Tat Lim) writes: > We are using timeslave to track an NTP primary. Assuming that the > 'err' column in the debugging output is in units of milliseconds, we > seem to be off by as much as 1/4 second at times, both fast and slow, > though we're usually under 150 ms. I have two questions: It's all milliseconds. 250 msec seems rather high. Is the system suffering kernel printf's? > 1) Why doesn't timeslave do a better job? The NTP host appears to > adjust its time by steps of 10 ms, as well as adjusting its clock > rate. timeslave seems to have a hard time following the 10 ms jumps. > Should I be fiddling with some of the timeslave options to get better > tracking? (It doesn't look like there's any control over the internal > filtering algorithm, unfortunately.) I have also observed oscillatory > behavior, in which timeslave will overshoot the zero error point > (often by as much as it was off before) over and over. Timeslave assumes its target is perfect. It was written originally to sync the SGI network to another company's with a satellite receiver, across a 9600 b/s Cypress link. The link was heavily loaded, esp. by netnews. While loaded, network delays could be assymmetric by as much as 4 seconds. If the measurements look good enough for long enough, a jump will look like network problems and be discarded. If the difference persists, timeslave will start opening the filter. "Filter" is doubtless too fancy a word for how timeslave tries to partition the measured error among long term drift, short term jumps, and errors in the measurements. At each adjustment, it changes the clock by one 32nd of accumulated differences between the expected and measured errors of the last 32 measurements (using 32 buckets to get the exact sum of the last 32 measurements) plus the current estimate of the drift. The drift is estimated by summing the last 8 hours of adjustments to the local clock. > 2) Is there a bug in the timetrim computation? It seems strange that > negative drifts (and -xx/yy sec/hr claims) end up with positive > timetrim values. The timetrim values are simply the total of all adjustments divided by total elapsed time, and the same for that last 24 hours. The difference in the short and long term values suggest that strange things happened in the past. Kernel printf's commonly mess up time on IRIS's because the disable all interrupts. Disk errors, disk full, and late collision message are common culprits. I've heard reports of difficulties with *ntp on trashed networks from exactly this sort of problem. There were bugs in the timetrim computations, fixed I think in 3.3.2. As I recall, it used int's and suffered over or underflow. > Maybe I should just give up and install xntp, but I was trying to get > decent time without the headaches of using yet another > vendor-unsupported package. If you want really good time, that might be a good idea. Increasing the measurement rate to '-r 15' would probably make timeslave much more accurate at modest cost. I've seen nearby *ntp machines claiming to be 3 msec from UTC, but according to my measurements using ICMP timestamps close to a second away. I've inferred that that the accuracy reported by the *ntp deamon is that which would be achieved if only the deamon could adjust the computer's clock as it wished. In other words, ntp appears to report as its accuracy the difference between the measured error and the predicted error. Is this a correct inference? Vernon Schryver, vjs@sgi.com
aspgpas@cidsv01.cid.aes.doe.CA (Peter Silva) (02/19/91)
|> > We are using timeslave to track an NTP primary. Assuming that the |> > 'err' column in the debugging output is in units of milliseconds, we |> > seem to be off by as much as 1/4 second at times, both fast and slow, |> > though we're usually under 150 ms. I have two questions: |> |> It's all milliseconds. 250 msec seems rather high. Is the system |> suffering kernel printf's? |> We run a couple of dozen Irises and have the same problem. It seems especially slow to sync up after systems have been down for a while (after a morning reboot, (no, it wasn't down any amount of time before or after) the station is still 1500 ms off at 4:30pm). |> > Maybe I should just give up and install xntp, but I was trying to get |> > decent time without the headaches of using yet another |> > vendor-unsupported package. |> |> If you want really good time, that might be a good idea. Increasing the |> measurement rate to '-r 15' would probably make timeslave much more |> accurate at modest cost. |> I tried (not too hard) building xntp, but it absolutely wanted the "tickadj" kernel variables to exist. Is there a version hacked for the IRIS? I've been using ntp on two stations for two months, on the same lan, configured to go to the same servers, (as a test). One of the stations is un-loaded, the other has me pounding away on it all day. ntp runs, and complains about the lack of tickadj, but seems to have a "plan B" to deal with it. The results vary wildly from week to week. Sometimes both stations are within a millisecond or two of whatever stratum 1 server it's picked, sometimes it's off by as much as a hundred milliseconds. And, of course, when it gets that far off, it starts polling every 64 seconds... AARRRGGHH! My checks are just using ntpdc. My station is the one that is usually goes nuts, but at the moment the other one is nuts too because it's been down awhile, and is still syncing up. My station usually has wildly high Dispersion numbers too, of course (but today they seem almost reasonable, (nothing over 300). |> I've seen nearby *ntp machines claiming to be 3 msec from UTC, but I don't think I'm getting more than double or triple the precision of timeslave at the moment. Why is still a mystery. I won't trust it until I know why this stuff happens. Peter Silva OS Support psilva@cid.aes.doe.ca Dorval Computing Centre (514) 421-4692 Atmospheric Environment Service