[comp.protocols.tcp-ip] WARNING: TOD clock not initialized -- CHECK AND RESET THE DATE!

dupuy@westend.columbia.edu (Alexander Dupuy) (01/01/88)

Ever since the leap second (23:59:60 GMT Jan 1, 1988) the realtime clocks on
Sun-3s have been behaving strangely.  When booting /vmunix, just after the
message about using nn buffers, the kernel prints out a little message like the
above.  That's not too bothersome for us, since we use rdate and ntpd to keep
our Suns' clocks in synch anyhow.

What is bothersome is that the system clocks have started to slew wildly.
Using a little program I hacked up, I have found that there are spurious deltas
showing up in adjtime(2) on *ANY SUN-3* which has had its time set or adjusted
since the leap second.  Running the following program a few times:

adjtime.c
---------
#include <sys/time.h>

struct timeval delta = { 0, 0 },
	    olddelta = { 0, 0 };

main ()
{
	if ( adjtime (&delta, &olddelta) == -1)
		perror ("adjtime");

	printf ("adjust %d.%d, oldslew %d.%d\n", delta.tv_sec, delta.tv_usec,
					olddelta.tv_sec, olddelta.tv_usec);
}

I get results like this:

Script started on Fri Jan  1 02:14:18 1988

finest# alias adj '/src/local/local/netdate/adjtime; date'
finest# adj
adjust 0.0, oldslew -1490.-143408
Fri Jan  1 02:15:26 EST 1988
westend# adj
adjust 0.0, oldslew 0.0
Fri Jan  1 02:15:32 EST 1988
westend# adj
adjust 0.0, oldslew -1729.-961408
Fri Jan  1 02:15:46 EST 1988
westend# 

westend# date 8712311650
Thu Dec 31 16:50:00 EST 1987
westend# adj
adjust 0.0, oldslew 0.0
Thu Dec 31 16:50:09 EST 1987
westend# adj
adjust 0.0, oldslew 0.0
Thu Dec 31 16:50:29 EST 1987
westend# adj
adjust 0.0, oldslew 0.0
Thu Dec 31 16:50:48 EST 1987

script done on Thu Dec 31 16:50:49 1987

As can be seen, every ten to fifteen seconds, some monstrous time adjustment
gets added in by the kernel.  This is *not* being done by ntp or any other time
daemon - it even happens in single user mode.  It can also be seen that after
the date is reset to 1987 (GMT) this behavior disappears, and time stabilizes.
The silly message when booting disappears as well.

So it looks like the guilty party is /sys/sundev/clock.c.  But not having
source code, what can I do?

Other observations: Our Sun-2s (bless their little obsolete cpus) have not even
stuttered since the leap second went down.  Their TOD clock code seems to be
just fine.

So will someone with access to Sun kernel sources please help me out?  This is
a serious bug, and I imagine Sun will have a patched OBJ/clock.o for binary
sites eventually, but in the meantime, it is stretching the resources of ntpd
to even keep the machines within a *minute* or so of true time.  The poor
machines which aren't running ntp are okay until they are rebooted, or someone
foolishly tries to set their time, but once that happens, their watch gears get
unsprung.

@alex
---
arpanet: dupuy@columbia.edu
uucp:	...!seismo!columbia!dupuy

---
arpanet: dupuy@columbia.edu
uucp:	...!seismo!columbia!dupuy

dupuy@westend.columbia.edu (Alexander Dupuy) (01/01/88)

Forgot to give the versions for which this problem exists:

SunOS 3.2
@(#)clock.c 1.1 86/07/07

and

SunOS 3.4
@(#)clock.c 1.2 86/10/08
---
arpanet: dupuy@columbia.edu
uucp:	...!seismo!columbia!dupuy

Mills@UDEL.EDU (01/02/88)

Alex,

The problem is that the leap second occured in a leap year, an event probably
unanticipated by Sun. Not to worry, it is easily fixed by simply stacking 
another penny on the pendulum, as was done by the Big Ben keepers. You may
have to use the old-fashioned kind before they started adding all that
aluminum goop. Old Pence are the best kind, but they are rather hard to find.

Dave

slevy@UC.MSC.UMN.EDU (Stuart Levy) (01/02/88)

Whew.  I was pinging umd1.umd.edu at leap second time, hoping to catch it in
the act (wonder how many others were doing the same thing?), when suddenly
the time difference started hurtling into outer space.  For a moment I
wondered if Dave Mills had added a leap minute instead of second, but no,
our SUNs had all gone mad.  It was a great relief to hear that someone
else saw the same thing.

I believe I have a fix for this..  Probably the easiest way to
distribute it without annoying SUN too much is as a binary patch.
Say:

# adb -w -k /vmunix /dev/mem
resettodr+0xca?X
	(It should contain 0x536efff4, a subqw #1,a6@(-0xc) instruction.)
	(Change it to NOP's in the /vmunix file with...)
.?W 4e714e71
	(and in the running kernel (this seems to be safe) with...)
./W 4e714e71
$q
#


For those who have source, the relevant module is sun3/clock.c.
The line in resettodr() reading

	t += MONTHSEC(--mon, year);

breaks, since MONTHSEC evaluates the --mon twice in leap years.
It could change to

	mon--;
	t += MONTHSEC(mon, year);

This appears to work on our SUNs running 3.3.


				Stuart Levy, Minn. Supercomputer Center
				slevy@uc.msc.umn.edu

slevy@UC.MSC.UMN.EDU (Stuart Levy) (01/03/88)

I forgot to mention in sending out the SUN kernel binary patch that
it ONLY works in leap years -- if you just apply the patch, it will break
in January 1989.  Probably everybody will have an official SUN fix by then
but you might want to keep a note of what changed, just in case.

					Stuart

kre@munnari.oz (Robert Elz) (01/03/88)

In article <8801020627.AA19752@uc.msc.umn.edu>,
slevy@UC.MSC.UMN.EDU (Stuart Levy) writes:
> I believe I have a fix for this..  Probably the easiest way to
> distribute it without annoying SUN too much is as a binary patch.

Then, In article <8801022203.AA25123@uc.msc.umn.edu>,
slevy@UC.MSC.UMN.EDU (Stuart Levy) writes again:
> I forgot to mention in sending out the SUN kernel binary patch that
> it ONLY works in leap years -- if you just apply the patch, it will break
> in January 1989.

Here's an alternative (binary) patch that will work in both leap years,
and in boring old ordinary years.


# adb -w -k /vmunix /dev/mem
resettodr+0xca?X
	(It should contain 0x536efff4, a subqw #1,a6@(-0xc) instruction.
	If you applied Stuart's patch it will contain 0x4e714e71, 2 nop's
	so put back the subw in both the kernel a.out, and memory)
.?W 536efff4
./W 536efff4
	(next, apply a slightly better fix)
resettodr+0xc0?i
	(it should contain "bnes resettodr+0xca", which we will change to be
	"bnes resettodr+0xce" and avoid the incorrect subw)
.?w 660c
	(now verify that its correct)
.?i
	(and assuming it is "bnes resettodr+0xca", change the running kernel)
./w 660c
$q

I can't verify that this actually fixes the reported problem, but it clearly
does fix a bug, and should have the same effect this year as Stuart's fix,
while not hurting next year.  I used SunOS 3.4 to do this, in case other
versions of SunOS deviate (3.3 is apparently the same), here is the original
section of binary ...

_resettodr+0xa6:                        movw    a6@(-0x10),d0
_resettodr+0xaa:                        moveq   #3,d1
_resettodr+0xac:                        andw    d1,d0
_resettodr+0xae:                        andl    #0xffff,d0
_resettodr+0xb4:                        bnes    _resettodr+0xca
_resettodr+0xb6:                        subqw   #1,a6@(-0xc)
_resettodr+0xba:                        cmpw    #2,a6@(-0xc)
_resettodr+0xc0:                        bnes    _resettodr+0xca	<<<< change this to
_resettodr+0xc2:                        movl    #0x263b80,d0
_resettodr+0xc8:                        bras    _resettodr+0xde
_resettodr+0xca:                        subqw   #1,a6@(-0xc)
_resettodr+0xce:                        moveq   #0,d0		<<<< branch to here
_resettodr+0xd0:                        movw    a6@(-0xc),d0
_resettodr+0xd4:                        lea     _monthsec:l,a0
_resettodr+0xda:                        movl    a0@(-4,d0:l:4),d0
_resettodr+0xde:                        addl    d0,d7

kre

dm@BFLY-VAX.BBN.COM (01/04/88)

Could someone explain to the rest of us why adding a leap-second had
any effect on Sun workstations other than making their clock be off by
one second (more than usual)?  Or is 1988 one bit too much in some
data structure?

slevy@UC.MSC.UMN.EDU (Stuart Levy) (01/04/88)

It wasn't the leap second, just the fact that 1988 is a leap year.
SUN-3's contain a date-and-time clock chip, and there's kernel code
to translate UNIX time <=> the clock chip's format.  In leap years,
the code broke and loaded garbage into the chip.  Later, on reading it
back, it noticed that the chip contained an invalid date.

ks@pur-ee.UUCP (Kirk Smith) (01/04/88)

In article <8801032141.AA15091@ucbvax.Berkeley.EDU> dm@BFLY-VAX.BBN.COM writes:
>
>Could someone explain to the rest of us why adding a leap-second had
>any effect on Sun workstations other than making their clock be off by
>one second (more than usual)?  Or is 1988 one bit too much in some
>data structure?

The Sun3 and the software for its Time of Day chip has been developed since
the last leap year.  If the system clock drifts from the time of day
clock, it is, by default, resynced.  A routine converts the YYMMDD.HHMMSS.THT
from the TOD chip to unix time in seconds from the beginning of time
(00:00 Jan 1 1970 GMT).  This routine did this incorrectly during leap years.
A "--mon" was used in an argument to a MACRO, and in leap years, that argument
got evaluated twice.

Can you say time bomb?

						Kirk Smith

mp@allegra (01/04/88)

If anyone has a patch for sun4's, please send it along; we don't yet
have source code or much knowledge of the assembly language.  A
short-term workaround, for both sun3's and sun4's, is to minimize
reliance on the tod clock: run rdate as soon as possible after booting,
and patch the kernel variable dosynctodr to be 0 so that the unix date
is not periodically copied (well, actually, it's adjtime'd) from the
incorrect info in the todr.  This workaround may result in
the unix date running a bit slow due to missed clock interrupts.

    Mark Plotnick
    Department of Solar Calendars
    allegra!mp

bzs@bu-cs.bu.EDU (Barry Shein) (01/05/88)

Oops, I lost the original note so I sent a reply to unix-wizards.
Anyhow, I found that adb'ing the kernel (or via sources) and setting
the variable 'dosynctodr' to zero (it's a flag which defaults to one)
that this problem seems to be ameliorated. I still am not sure what's
causing it or if it's related to the leap-second or not but sections
of the source which slew the clock seem to always check dosynctodr
first.  It shouldn't affect using adjtime from outside the kernel.

I would consider this an emergency patch until someone from Sun sheds
some light on the issue of what's really going on but the systems I've
applied this patch to have been keeping time fine. I'm sure there's
some negative consequence to this patch (always have to set time on
boot? not sure.) I agree tho, I've had systems getting off by an hour
or more of late.

	-Barry Shein, Boston University

bzs@bu-cs.bu.EDU (Barry Shein) (01/05/88)

Urgh, the nice thing about all these patches is one gets so many to
choose from. Stu Levy's looks better than mine (mine: to turn off
dosynctodr), if you applied mine undo it (trivial) and try his.

I'm copying the note so it nullifies my advice on Unix-wizards also.

	-Barry Shein, Boston University

Date: Sat, 2 Jan 88 00:27:09 CST
From: slevy@uc.msc.umn.edu (Stuart Levy)
To: tcp-ip@sri-nic.arpa, westend!dupuy@columbia.edu
Subject: Re:  WARNING: TOD clock not initialized -- CHECK AND RESET THE DATE!

Whew.  I was pinging umd1.umd.edu at leap second time, hoping to catch it in
the act (wonder how many others were doing the same thing?), when suddenly
the time difference started hurtling into outer space.  For a moment I
wondered if Dave Mills had added a leap minute instead of second, but no,
our SUNs had all gone mad.  It was a great relief to hear that someone
else saw the same thing.

I believe I have a fix for this..  Probably the easiest way to
distribute it without annoying SUN too much is as a binary patch.
Say:

# adb -w -k /vmunix /dev/mem
resettodr+0xca?X
	(It should contain 0x536efff4, a subqw #1,a6@(-0xc) instruction.)
	(Change it to NOP's in the /vmunix file with...)
.?W 4e714e71
	(and in the running kernel (this seems to be safe) with...)
./W 4e714e71
$q
#


For those who have source, the relevant module is sun3/clock.c.
The line in resettodr() reading

	t += MONTHSEC(--mon, year);

breaks, since MONTHSEC evaluates the --mon twice in leap years.
It could change to

	mon--;
	t += MONTHSEC(mon, year);

This appears to work on our SUNs running 3.3.


				Stuart Levy, Minn. Supercomputer Center
				slevy@uc.msc.umn.edu

bzs@bu-cs.bu.EDU (Barry Shein) (01/05/88)

It's starting to look like a leap-year, not a leap-second bug, a macro
expanding to add some value twice instead of once (quick analysis.)

Don't believe everything you read on the net (joke.)

	-B

mark@nova.usc.edu (Mark A. Brown) (01/05/88)

Here's a binary patch for the leap year bug that will work for Sun 4s.

# adb -w -k /vmunix /dev/mem
resettodr+0x110?X
	(It should contain 0xba276001, a sub %i5, 0x1, %i5 instruction.
	(Change it to a nop in both /vmunix and kernel memory)
.?W 0x1000000
./W 0x1000000
	(To make yourself feel better, do the following)
.?i
./i
	(If they're nop's, things should now be better)
$q

We are running the SYS4 GAMMA release, but things should be the same
for SYS4 3.2.  If not, here's the original GAMMA binary and you can go from
there.

_resettodr+0xf0:                srl     %i5, 0x10, %i5
_resettodr+0xf4:                orcc    %g0, %i1, %g0
_resettodr+0xf8:                bne     _resettodr + 0x120
_resettodr+0xfc:                sub     %i5, 0x1, %i5
_resettodr+0x100:               sll     %i5, 0x10, %i5
_resettodr+0x104:               srl     %i5, 0x10, %i5
_resettodr+0x108:               cmp     %i5, 0x2
_resettodr+0x10c:               bne,a   _resettodr + 0x120
_resettodr+0x110:               sub     %i5, 0x1, %i5	     <<< change to nop
_resettodr+0x114:               sethi   %hi(0x263800), %o5
_resettodr+0x118:               ba      _resettodr + 0x134
_resettodr+0x11c:               add     %o5, 0x380, %i3
_resettodr+0x120:               sll     %i5, 0x10, %i5
_resettodr+0x124:               srl     %i5, 0x10, %i5
_resettodr+0x128:               sub     %i5, 0x1, %i3
_resettodr+0x12c:               sll     %i3, 0x2, %i3
_resettodr+0x130:               ld      [%i3 + %l7], %i3
_resettodr+0x134:               add     %i4, %i3, %i4
_resettodr+0x138:               mov     %i4, %o0



	Mark