[comp.protocols.time.ntp] NTP and the infamous SPARC clock problem

wales@CS.UCLA.EDU (Rich Wales) (11/30/90)

About a month ago, I reported that we (UCLA CS Department) had been
having a very nasty time trying to keep our SPARCstations (SPARC-1's and
SPARC-SLC's, running SunOS 4.0.3) time-synched via NTP.

I've done the following things already:

(1) Everyone is running "in.ntpd", May '89 version, patch level 13.

(2) "_dosynctodr" has been set to zero in all kernels.

(3) "_tick" has been changed from 10,000 to 9,998 in all kernels (one
    person on the NTP list suggested this, and it seems to have helped).

But the problem still recurs.  Especially on the SLC's -- but sometimes
on the SPARC-1's too -- a clock may lose over an hour before someone
finally sees the problem, reports it to our "help" mailing list (system
support staff), and the clock gets reset by hand.  Our users usually
notice the problem because the NTP daemon starts complaining incessantly
about how the clock is too far off for NTP to deal with it.

Our five Sun-4/380's do not suffer from this problem at all, by the way.

I'm aware of the existence of a clock problem in the SPARCs, but I had
been hoping that NTP might be able to keep it under control.  Even if
we could keep our SPARCs to within a second or so of real time, I'd be
willing to lower my standards :-} and accept such a situation.

I'm also concerned that the clock problems in our SPARCs is creating a
general impression around our department that NTP is flaky and unproven
(and that perhaps we should be running something supposedly more "stan-
dard" like "rdate" or "timed" instead; no smileys here, sad to say).

When I reported this problem about a month ago, one user said he had a
set of kernel patches that would fix the problem.  But he never deliv-
ered, and he eventually confessed that he had misplaced the patches and
could offer no hope of ever being able to get them to me.

I'm willing to switch to "xntpd", but only if someone can provide me
with positive assurances that this other NTP implementation will fix the
problem.

Thanks very much for any concrete assistance anyone can provide us.

Rich Wales <wales@CS.UCLA.EDU> // UCLA Computer Science Department
3531 Boelter Hall // Los Angeles, CA 90024-1596 // +1 (213) 825-5683
"This is yet another example of how our actions have random results."

edward@TWG.COM ("Edward C. Bennett") (12/01/90)

Rich Wales writes:
>
>I'm also concerned that the clock problems in our SPARCs is creating a
>general impression around our department that NTP is flaky and unproven
>(and that perhaps we should be running something supposedly more "stan-
>dard" like "rdate" or "timed" instead; no smileys here, sad to say).

Is timed able to keep a SPARC's clock in line? Has anyone tried this?
Maybe SPARCs are just beyond all hope...;-)

BTW, what happens on a standalone SPARC? No ntp, no timed, nothing...
how fast do they drift?
-- 
Edward C. Bennett - The other MMDF guy			edward@twg.com
The Wollongong Group					(415) 962-7252
1129 San Antonio Road, Palo Alto, CA 94303
   "He's become a growling, snarling mass of white-hot canine terror"

thorinn@DIKU.DK (Lars Henrik Mathiesen) (12/02/90)

Edward,

We run a flock of VAXen on ntp, and on those we run a jimmied timed
whose only function is to act as master for our various Suns (*).
This works fine now that we've set dosynctodr to 0 in the Sun
kernels; I just checked, and most Suns are within 25 ms of the
current timed master. The SparcStations all run very fast and will
gain about 75 ms between timed syncs (every four minutes); but as
someone suggested, we could set tick to 9998 which would probably
bring them into line.

(Before we reset dosynctodr, we'd see the SparcStation clocks slew up
to sync once every four minutes and then slew even faster (about 1
second in two!) back to an (increasing) offset of up to 20 seconds.
When the offset grew larger than that, timed would log a complaint and
do a settimeofday, starting the cycle again.)

It seems that an unloaded SparcStation with dosynctodr==0 is about 300
ppm fast (about 30 seconds a day). When it was set, they'd generate a
timed log message every three to four hours during working hours only
(about 1500 ppm slow). I guess the slowness when loaded is due to lost
clock interrupts although I'm unable to imagine what sort of bogosity
in SunOS is is that makes this time loss persistent when using the
time-of-day register.

Lars
_____________________________________________________________________
(*) Patches for anonymous ftp at freja.diku.dk:misc/ntp-timed.patch .

Mills@udel.edu (12/04/90)

Rich,

The problem has been reported to be lost clock ticks due the practice
of disabling interrupts while dirty pages are swapped to backing store,
which occurs about once every 30 seconds. Apparently, this can result
in periods up to several hundred milliseconds during which clock
interrupts are stuck. It has also been reported that the fix of choice
is to dump the System-V clock code in favor of the old 4.3bsd clock code.
This has not been verified here. Diskless clients should have no trouble,
as should not workstations that don't dirty too many pages/second. I
do not know what if anything Sun is doing about this. Our gaggle of
SPARCs keep pretty good time, but they are hardly stressed and usually
dirty only the fileserver's pages.

It is possible to widen the aperture NTP uses to distinguish clock
jitter from broken clocks. Ex box this aperture is +-128 ms, but could
easily be made much larger. However, if a few hundred milliseconds is
being yanked from under it every 30 seconds or so, NTP is not the protocol
of choice. Run NTP on a stable platform somewhere and a bugged timed to
keep the rascals in line.

Dave

seeger@MANATEE.CIS.UFL.EDU (F. L. Charles Seeger III) (12/04/90)

+------ Mills@udel.edu wrote (Mon,  3-Dec-90, 18:11 GMT):
| 
| It is possible to widen the aperture NTP uses to distinguish clock
| jitter from broken clocks. Ex box this aperture is +-128 ms, but could
| easily be made much larger. However, if a few hundred milliseconds is
| being yanked from under it every 30 seconds or so, NTP is not the protocol
| of choice. Run NTP on a stable platform somewhere and a bugged timed to
| keep the rascals in line.

I have patched timed so that when it is run in master mode it won't update
the system clock.  Otherwise, having ntp and a timed running on the same
machine can cause trouble.  This code has survived through one incident
where a timed with the wrong time got elected to be "master".

If anyone wants the patches, send me mail.

Chuck
--
  Charles Seeger    E301 CSE Building             Office: +1 904 392 1508
  CIS Department    University of Florida         Fax:    +1 904 392 1220
  seeger@ufl.edu    Gainesville, FL 32611-2024