wales@CS.UCLA.EDU (Rich Wales) (11/30/90)
About a month ago, I reported that we (UCLA CS Department) had been having a very nasty time trying to keep our SPARCstations (SPARC-1's and SPARC-SLC's, running SunOS 4.0.3) time-synched via NTP. I've done the following things already: (1) Everyone is running "in.ntpd", May '89 version, patch level 13. (2) "_dosynctodr" has been set to zero in all kernels. (3) "_tick" has been changed from 10,000 to 9,998 in all kernels (one person on the NTP list suggested this, and it seems to have helped). But the problem still recurs. Especially on the SLC's -- but sometimes on the SPARC-1's too -- a clock may lose over an hour before someone finally sees the problem, reports it to our "help" mailing list (system support staff), and the clock gets reset by hand. Our users usually notice the problem because the NTP daemon starts complaining incessantly about how the clock is too far off for NTP to deal with it. Our five Sun-4/380's do not suffer from this problem at all, by the way. I'm aware of the existence of a clock problem in the SPARCs, but I had been hoping that NTP might be able to keep it under control. Even if we could keep our SPARCs to within a second or so of real time, I'd be willing to lower my standards :-} and accept such a situation. I'm also concerned that the clock problems in our SPARCs is creating a general impression around our department that NTP is flaky and unproven (and that perhaps we should be running something supposedly more "stan- dard" like "rdate" or "timed" instead; no smileys here, sad to say). When I reported this problem about a month ago, one user said he had a set of kernel patches that would fix the problem. But he never deliv- ered, and he eventually confessed that he had misplaced the patches and could offer no hope of ever being able to get them to me. I'm willing to switch to "xntpd", but only if someone can provide me with positive assurances that this other NTP implementation will fix the problem. Thanks very much for any concrete assistance anyone can provide us. Rich Wales <wales@CS.UCLA.EDU> // UCLA Computer Science Department 3531 Boelter Hall // Los Angeles, CA 90024-1596 // +1 (213) 825-5683 "This is yet another example of how our actions have random results."
edward@TWG.COM ("Edward C. Bennett") (12/01/90)
Rich Wales writes: > >I'm also concerned that the clock problems in our SPARCs is creating a >general impression around our department that NTP is flaky and unproven >(and that perhaps we should be running something supposedly more "stan- >dard" like "rdate" or "timed" instead; no smileys here, sad to say). Is timed able to keep a SPARC's clock in line? Has anyone tried this? Maybe SPARCs are just beyond all hope...;-) BTW, what happens on a standalone SPARC? No ntp, no timed, nothing... how fast do they drift? -- Edward C. Bennett - The other MMDF guy edward@twg.com The Wollongong Group (415) 962-7252 1129 San Antonio Road, Palo Alto, CA 94303 "He's become a growling, snarling mass of white-hot canine terror"
thorinn@DIKU.DK (Lars Henrik Mathiesen) (12/02/90)
Edward, We run a flock of VAXen on ntp, and on those we run a jimmied timed whose only function is to act as master for our various Suns (*). This works fine now that we've set dosynctodr to 0 in the Sun kernels; I just checked, and most Suns are within 25 ms of the current timed master. The SparcStations all run very fast and will gain about 75 ms between timed syncs (every four minutes); but as someone suggested, we could set tick to 9998 which would probably bring them into line. (Before we reset dosynctodr, we'd see the SparcStation clocks slew up to sync once every four minutes and then slew even faster (about 1 second in two!) back to an (increasing) offset of up to 20 seconds. When the offset grew larger than that, timed would log a complaint and do a settimeofday, starting the cycle again.) It seems that an unloaded SparcStation with dosynctodr==0 is about 300 ppm fast (about 30 seconds a day). When it was set, they'd generate a timed log message every three to four hours during working hours only (about 1500 ppm slow). I guess the slowness when loaded is due to lost clock interrupts although I'm unable to imagine what sort of bogosity in SunOS is is that makes this time loss persistent when using the time-of-day register. Lars _____________________________________________________________________ (*) Patches for anonymous ftp at freja.diku.dk:misc/ntp-timed.patch .
Mills@udel.edu (12/04/90)
Rich, The problem has been reported to be lost clock ticks due the practice of disabling interrupts while dirty pages are swapped to backing store, which occurs about once every 30 seconds. Apparently, this can result in periods up to several hundred milliseconds during which clock interrupts are stuck. It has also been reported that the fix of choice is to dump the System-V clock code in favor of the old 4.3bsd clock code. This has not been verified here. Diskless clients should have no trouble, as should not workstations that don't dirty too many pages/second. I do not know what if anything Sun is doing about this. Our gaggle of SPARCs keep pretty good time, but they are hardly stressed and usually dirty only the fileserver's pages. It is possible to widen the aperture NTP uses to distinguish clock jitter from broken clocks. Ex box this aperture is +-128 ms, but could easily be made much larger. However, if a few hundred milliseconds is being yanked from under it every 30 seconds or so, NTP is not the protocol of choice. Run NTP on a stable platform somewhere and a bugged timed to keep the rascals in line. Dave
seeger@MANATEE.CIS.UFL.EDU (F. L. Charles Seeger III) (12/04/90)
+------ Mills@udel.edu wrote (Mon, 3-Dec-90, 18:11 GMT): | | It is possible to widen the aperture NTP uses to distinguish clock | jitter from broken clocks. Ex box this aperture is +-128 ms, but could | easily be made much larger. However, if a few hundred milliseconds is | being yanked from under it every 30 seconds or so, NTP is not the protocol | of choice. Run NTP on a stable platform somewhere and a bugged timed to | keep the rascals in line. I have patched timed so that when it is run in master mode it won't update the system clock. Otherwise, having ntp and a timed running on the same machine can cause trouble. This code has survived through one incident where a timed with the wrong time got elected to be "master". If anyone wants the patches, send me mail. Chuck -- Charles Seeger E301 CSE Building Office: +1 904 392 1508 CIS Department University of Florida Fax: +1 904 392 1220 seeger@ufl.edu Gainesville, FL 32611-2024