maree@cs.uq.oz.au (03/11/91)
After Russell Mosemann's posting last week Re: Previous time adjustment > The error message is printed when xntpd does one of its every-4-seconds > adjustments and then finds that not all of the previous adjustment had > been done yet. > I talked to Sun and found out that an early version of SUNOS 4.1.1 > had a bug in adjtime() which would not complete adjustments under > certain circumstances. For me, it happened when one of the programs was > giving the system a run for the money. > Anyway, the part number of my tape is 700-2687-10 Rev. A. The next > revision should have the fix in it. The bug number is 1036401, in case > anyone is interested. I pursued the problem with Sun. Thanks to Sun (both Australia and U.S.) for a very fast response. The bug-report (#1036401) refers to adjtime() and date -a being ineffective. The bug is fixed in 4.1.1, there is no patch. The tickadj bug (details at end of this posting) explain Alan Young's problem: > Subject: ntp & Sparcstation running 4.0.1 -- loss of sync > From: Alan Young <awy%concurrent.co.uk@RELAY.CS.NET> > Running ntpd (manual page dated 15 June 1989) on a Sun4 sparcstation > running SunOS 4.0.1c I can get in sync with another host and get the > offset down to something reasonable (< 100 ms) with a dispersion in the > thousands. Then the dispersion suddenly goes out to 64000 and we are > out of sync for a couple of minutes. This cycle repeats about every 5 > minutes. The offset often goes out to as much as 5s. Someone suggested > that this may be a know problem (with solution): any ideas? > But.... it appears that the "Previous time adjustment didn't complete" "bug" has nothing to do with the tickadj bug. The following and the Bug report below are reprinted with permission. ------------------------ Bug Id: 1053351 Category: kernel Subcategory: syscall Bug/Rfe: bug State: closed Synopsis: adjtime() does not work well with xntpd This is not a kernel or syscall bug. The behavior of adjtime(2) in this case is consistent with the definition of its proper behavior as described in the adjtime(2) man page: The adjustment is effected by speeding up (if that amount of time is positive) or slowing down (if that amount of time is negative) the system's clock by some small percentage, gen- erally a fraction of one percent. Thus, the time is always a monotonically increasing function. A time correction from an earlier call to adjtime() may not be finished when adj- time() is called again. If olddelta is not a NULL pointer, then the structure it points to will contain, upon return, the number of microseconds still to be corrected from the earlier call. If olddelta is a NULL pointer, the corresponding information will not be returned. The error message noted in the bug report is generated in xntpd/ntp_unixclock.c: if (adjtime(&adjtv, &oadjtv) != 0) syslog(LOG_ERR, "Can't do time adjustment: %m"); if (oadjtv.tv_sec != 0 || oadjtv.tv_usec != 0) { syslog(LOG_ERR, "Previous time adjustment didn't complete"); The "error" case is, in fact, a perfectly legitimate result of the adjtime() call -- a previous adjustment hasn't had time to complete. The program should not assume that the previous time adjustment has completed. --------------------- Here's the bug report: Category: kernel Subcategory: syscall Release summary: 4.1, 4.0.3, 4.1_psr_a, 4.0, 4.1.1-alpha2 Bug/Rfe: bug State: closed Synopsis: adjtime() and date -a are ineffective Keywords: ineffectiv, -a, date, adjtime(), adjtime Severity: 2 Priority: 3 Description: When the system time is adjusted via adjtime() (as invoked by date -a), the time changes to the correct time. However, within a minute, the system automatically generates an opposite adjustment. This occurs since hardclock fails to ever call doresettodr() to set the rtc and when synctodr() is called, the opposite adjustment pushes the system time back. The description field as copied from bug report 1038434 follows: When the system clock is changed via an "adjtime" system call, the contents of the TOD chip are never modified to reflect the adjusted time; at the next reboot, the adjustment disappears. The description field as copied from bug report 1045448 follows: when using date -a, the 4.1 server increments the date to the required level. Then the date decreases to it's old level. It seems to work on 4.1.1 Beta, but I was not able to test it on a 4/[34]90 as there is no 4.1.1 PSRA Beta. Setting the date with date works as it should. The description field as copied from bug report 1045516 follows: Customer tries to use the adjtime(2) system call to syncronize the time between several machines each night. If he determines that a machine is 20 seconds slow, he tries to move it forward 20 seconds. This takes a couple of minutes and does work. Some time later, the customer notices that the time is, once again, 20 seconds slow. He has been able to determine that "something" in the system (after the 20 second advance has worked) is plugging negative values into the adjtime(2) call and moving the clock back to its original value. The negative values being put in are in the -2 to -3 range each time. He was seeing this by putting in 20, waiting a period of time, and then putting 0 into adjtime. This puts 0 into the register, and returns the value that was there already. The return value at increasing times returns decreasing values as they approach zero, then they start becoming negative values. The same kind of thing happens if he tries to adjust the time backwards, it works, and then works its way forward again. Work around: Setting the date rather than adjusting does work. However, this causes either time gaps or repetition of time intervals. The work around field as copied from bug report 1038434 follows: When a permanent change to the system time is required, set the time explicity with date or synchronize to a server with rdate. If synchronizing the system clock to an external standard, as when using NTP, the logic for slaving the software time to the TOD chip time should be disabled: # adb -w /vmunix dosynctodr?W 0 $q # The work around field as copied from bug report 1045448 follows: Use date instead of date -a. The work around field as copied from bug report 1045516 follows: use the date command, but this will disrupt the time continuum on the system. Suggested fix: The suggested fix field as copied from bug report 1038434 follows: Repair the logic for calling resettodr in kern_clock.c State triggers: Evaluation: Yup. The problem is true as stated: if (timedelta == 0) { BUMPTIME(&time, tick); } else { register delta; if (timedelta < 0) { delta = tick - tickdelta; timedelta += tickdelta; } else { delta = tick + tickdelta; timedelta -= tickdelta; } BUMPTIME(&time, delta); if (-tickdelta < timedelta && timedelta < tickdelta) { timedelta = 0; if (doresettodr) { if (doresettodr == 1) doresettodr = time.tv_sec; if (doresettodr != time.tv_sec) { doresettodr = 0; resettodr(); } } } } When the timedelta drops enuf to be zeroed, then doresettodr is set to cause the clock chip to be reset at the next second tick. Unfortunately, that is never reached. The next time that synctodr() runs after timedelta gets zeroed, the time is adjusted back to where it would have been without the adjtime(2) call. The evaluation field as copied from bug report 1038434 follows: 14may90 limes -- first, thanx to steve chessin for finding this. in sys/os/kern_clock.c:hardclock() ... When a correction has been applied via adjtime() [timedelta is nonzero and doresettodr is nonzero], and the correction has come to its closest point, the current time is remembered so we can set the chip clock just as we tick over the next second. The code that notices that we are ticking over the next second is not in fact ever reached, as it is contained inside the conditional for "timedelta != 0", and we have already cleared the timedelta value. The evaluation field as copied from bug report 1045448 follows: We believe this to be a duplicate of 1036401, which has been fixed in 4.1.1. Commit to fix in releases: 4.1.1-beta1 Fixed in releases: 4.1.1-beta1 Integrated in releases: 4.1.1-beta1 Verified in releases: 4.1.1-beta2 Closed because: fixed verified Public Summary: The adjtime() function and date(1) -a option are only temporarily effective, and the system immediately undoes the adjustment when the system clock becomes correct. Hook 2: Needs investigation in release: Bug End: ------------------------------------------------------------------ Maree Hegarty maree@cs.uq.oz.au Computer Science, University of Queensland, 4072, Australia Ph: +61 7 365 2864 Fax: +61 7 365 1999 -------------------------------------------------------------------
mouse@LIGHTNING.MCRCIM.MCGILL.EDU (der Mouse) (03/12/91)
>> The "error" case is, in fact, a perfectly legitimate result of the >> adjtime() call -- a previous adjustment hasn't had time to complete. >> The program should not assume that the previous time adjustment has >> completed. > This, of course, completely misses the point. The program has > carefully crafted the adjtime() values so that the adjustment should > have completed by the next time adjtime() is called. Only because it assumes something about the way adjtime is implemented (the tickadj value and the way it's used). > The problem with SunOS is that someone else (the kernel) is doing > adjtime()s behind your back. Right. >> If synchronizing the system clock to an external standard, as >> when using NTP, the logic for slaving the software time to the >> TOD chip time should be disabled: >> # adb -w /vmunix >> dosynctodr?W 0 >> $q >> # > Again, this misses the point. It fixes the problem mentioned above - that the kernel is re-skewing the software time to match the hardware time. It should deal with the pseudo-adjtime you mentioned above. To fix it right, of course, the hardware clock should be reset correctly when adjtime() is used. > The real problem is that clock interrupts are being lost. This is a separate problem, not related to the kernel effectively doing adjtime()s behind ntp's back. (Not related except perhaps for their being part of the same timekeeping implementation, that is.) The problem here is that some things (notably output to the default tty emulator for the console) tend to lock out clock interrupts for excessively long intervals. (And I must admit, given how fast a SPARC can blit things around, scrolling is *incredibly* slow. Does it handle each pixel separately or what?!) der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu
bww+@K.GP.CS.CMU.EDU (Bradley White) (03/12/91)
>> The program has >> carefully crafted the adjtime() values so that the adjustment should >> have completed by the next time adjtime() is called. > Only because it assumes something about the way adjtime is implemented > (the tickadj value and the way it's used). Just as it must in order to do the best job of slewing the time (without some further support like, for example, adjtime2()). >> The real problem is that clock interrupts are being lost. > This is a separate problem If you are losing interrupts with any regularity, forget about using NTP to synchronize clocks---it will be continually confused by the shifting frequency of the local clock. Something coarser will suffice. > The > problem here is that some things (notably output to the default tty > emulator for the console) tend to lock out clock interrupts for > excessively long intervals. I assert that any event that takes longer than the time between two clock interrupts to process should not be running at a priority higher than the clock. Bradley
rbthomas@frogpond.rutgers.edu (Rick Thomas) (03/26/91)
In article <12536@goofy.Apple.COM> kerlyn@apple.com (Kerry Lynn) writes | In article <13990.668752214@K.GP.CS.CMU.EDU> bww+@K.GP.CS.CMU.EDU (Bradley | White) writes: | > If you are losing interrupts with any regularity, forget about using NTP | > to synchronize clocks---it will be continually confused by the shifting | > frequency of the local clock. Something coarser will suffice. | | Is this statement accurate? If so, wouldn't this be an "implementation | issue" of the first rank? Can such a situation be mitigated by lowering | the advertised precision of one's clock or by using "mains" parameters | for the PLL? I'd really appreciate hearing from people who've faced | this problem and solved it. Don't worry, that statement is actually not accurate. indeed, it is one reason why xntpd updates drift and compliance estimates based only on 'slew's and not on 'step's. If the clock gets too far out of whack (as it does on Sparcstations when interrupts are getting dropped alot) xntpd just does a "step" adjustment and starts over from scratch. If the clock is only out by enough to fit within the "window" of slewing (128 ms usually, but it can be configured all the way up to 499 ms without having to make major code changes) then xntpd does the appropriate adjtime and also makes adjustments to its own idea of how much the clock is drifting. Experience shows that this actually copes quite well with the situation encountered in Sparcstations (and other machines too) When interrupts are getting dropped at a large (hence unpredictable) rate, the clock advances by a series of discrete steps but stays on track as well as can be expected under the circumstances; when things are going good, and interrupts are not being dropped (or are being dropped only infrequently) then the algorithm can get a good handle on the actual amount of drift in the hardware clock, and it does. It is this drift estimate which is used later to correct for clock drift every 4 second. WARNING The following is based mostly on rumor and surmise. Will somebody who knows for sure please speak up? What actually seems to be happening on Sparcstations is that the S-bus standard specifies all S-bus peripherals will have a minimal handler in ROM on the S-bus card. (Do you think they learned this trick from the Apple II by any chance?) That handler is not written in SPARC assembler as you would naturally assume, but rather in FORTH, so that the same card can be used on any machine that provides an S-bus, regardless of the host CPU type. One of the specifications of the S-bus is that the host machine will provide a FORTH interpreter in its own boot-ROM for use in driving S-bus cards. Now Sun, in their wisdom, seem to feel that nobody would *ever* want to run for an extended period of time directly on the console without using a windowing system, so to avoid having to write a console driver in C (or SPARC assembler) for every possible different kind of console, they just use the single standard FORTH driver interface. This accounts for the fact that Sparcstation consoles without windowing programs are so Gawd-awfully slow (2400baud seems an appropriate comparison) and (because the FORTH interpreter is run with interrupts turned off -- who knows why?) it also accounts for the dropped clock interrupts. Enjoy! Rick