[comp.sys.sun] Sparcstation time keeping woes

knutson@perseus.sw.MCC.COM (Jim Knutson) (10/02/89)

| Included Message:

X-Date:    Sat, 30 Sep 89 17:13:32 EDT
X-From:    Dennis Ferguson <dennis@gw.ccie.utoronto.ca>

I have done battle trying to run NTP on a number of Sparcstation I's.
This has turned out not to be possible to do in a reasonable manner.  The
related observations are:

(1) The clocks on sparcstation I's run obscenely fast, somewhere between
300 and 350 ppm (i.e. a drift value of somewhere between -1.2 and -1.4).
This is too far off for NTP to capture well, since even if NTP steps the
clock to on time it will have drifted 200 ms or so before sys.hold
expires, which is outside the loop filter's aperture, and the clock will
have to be stepped again.  This can be repaired, however, by estimating
the drift by hand and initializing the drift file to this value.

(2) With an error this large the clock should be gaining 25 or 30 seconds
per day if left to itself.  The clock doesn't gain anywhere near this
much, however.  In fact, the ones we have run a second or two per day off,
some fast and some slow.  This would indicate that something must be
setting the clock back from time to time.  It is indeed the case that
something in the kernel is making the clock jump around.  The ntp daemon
will hold the time just fine for a while, and then all of a sudden the
time will change underneath it by increments which are usually less than a
second, but sometimes more than two seconds.

This is what I see.  The much of the rest is utter speculation since we
have no source for this operating system.

(1) Sun has produced a machine with a clock interrupt timer which is
incapable of producing interrupts which are an exact integral fraction of
a second given the frequency of the oscillator driving it.  This is
unfortunate since the value of hz is an integer.  On the sparcstation hz
is defined as 100, but should probably be something more like 100.03.

(2) The value of tick, the number of microseconds added to the time on
each clock interrupt, is computed as 1000000/hz.  This makes the clock run
fast.

(3) There is a battery backed up time-of-year clock in the sparcstation
which has a precision of about a second or so.  The time-of-year clock is
reset when you call settimeofday(), but nothing is done to it when you
call adjtime() since the precision is too crude (I have no idea whether
this is true or not.  It is a guess at how one might produce the symptoms
that are seen).

(4) Note that the clock was made to run fast in (2), and if you sell
machines with clocks which gain half a minute a day some people will
probably complain (??).  What should have been done to repair this is to
fix (2), by setting tick to a value which reflects the actual interrupt
interval of the timer (my guess is about 9997).  This would have allowed
them to get the clock speed to within about 50 ppm without further
complication, and this is adequately respectable by Sun standards.  There
is no reason that tick has to be 1000000/hz that I can see.

(5) Unfortunately, (4) was too easy.  Instead I suspect that some bright
light discovered that, while the interrupt timer might gain half a minute
a day, the time-of-year clock was good for a couple of seconds a day, so
all you had to do was keep the system time in line with the time-of-year
clock and everything would be fine.  Of course the time-of-year clock is
crude, but who worries about the stuff below a second anyway?  Just keep
comparing the time until the truncated system clock value exceeds the time
of year clock by more than a second or so, then step it back.  Note there
is a truncation involved in this comparison.  This may explain why the
steps backwards are sort of randomly sized.

This corresponds what I see.  Something in the kernel on sparcstations
keeps stepping the clock backwards.  This insistance that the time-of-year
clock is more accurate than time keeping software (if that's what it is)
makes trying to synchronize these things futile.  This, of course, would
also break timed, which may or may not be why Sun doesn't ship it any
more.  And it makes the adjtime() call nearly useless.

I can fix the value of tick so that the system clock keeps better time by
myself.  What I haven't been able to do is figure out how to turn off
whatever it is in the kernel which keeps bumping the clock.  I would be
very grateful if someone could tell me how to do this on a binary system.

Dennis