[comp.sys.sun] 4/330 time problems

sam@ics.uci.edu (Sam Horrocks) (11/09/89)

Our Sparcserver 4/330's are having trouble keeping the correct time.
About once a week, they become about 45 minutes slow.  Sun gave me the
following patch to get rid of the problem.  I just installed the patch
today so I can't say that it has fixed the problem, but it has not caused
the system to do anything terrible, so if you have the same problem it
couldn't hurt to try this.

I don't claim to know what this patch does.  Use it at your own risk.  If
anyone can offer other solutions, please send them to me at
sam@ics.uci.edu

Here's Sun's patch:

Topic: Time of Day clock

Symptom:
Loss of the correct Time of Day over a period of time

Problem Description:
There is a bug in SunOS 4.0.3 which causes the Sun 4300 processor
board to be unable to synchronize the kernel's notion of the time
of day with the TOD chip.

Corrective Action:
To patch a running kernel: (will be fixed until next reboot)

        # adb -w -k /vmunix /dev/mem
        todget+0x1e0/X
<old value should be 80a3e006>
        todget+0x1e0/W 80a3e007
        $q
        #

To patch the kernel on disk:

        # adb -w /vmunix
        todget+0x1e0?X
<old value should be 80a3e006>
        todget+0x1e0?W 80a3e007
        $q
        #

To patch future kernels:

        # adb -w /sys/sun4/OBJ/clock.o
        todget+0x1e0?X
<old value should be 80a3e006>
        todget+0x1e0?W 80a3e007
        $q
        #

Sam

dupuy@cs.columbia.edu (11/14/89)

> Problem Description:
> There is a bug in SunOS 4.0.3 which causes the Sun 4300 processor
> board to be unable to synchronize the kernel's notion of the time
> of day with the TOD chip.
> 
> Corrective Action:
> To patch a running kernel: (will be fixed until next reboot)
> 
> 	   # adb -w -k /vmunix /dev/mem
> 	   todget+0x1e0/X
> <old value should be 80a3e006>
> 	   todget+0x1e0/W 80a3e007
> 	   $q
> 	   #

Just as a note, the clock.o module that comes with the SPARCserver 390
"feature" tape (i.e. support for IPI disks/controllers) already has this
patch in it.  It certainly seems to be able to set the TOD chip, in fact
it tells us every time it does so! :-(

Nov 12 23:54:11 hudson vmunix: resettodr: setting TOD chip to 12-13-21 04:54:11
Nov 12 23:54:11 hudson vmunix: resettodr: TOD chip was 12-13-21 04:54:11
Nov 12 23:54:23 hudson vmunix: resettodr: setting TOD chip to 12-13-21 04:54:23
Nov 12 23:54:23 hudson vmunix: resettodr: TOD chip was 12-13-21 04:54:23

We run NTP, and it seems only to be able to keep the clock within about
100 ms of our local secondary time servers.  So there's still some bad
stuff happening with them thar clocks.

inet: dupuy@cs.columbia.edu
uucp: ...!rutgers!cs.columbia.edu!dupuy

ksp@maxwell.nde.swri.edu (Keith S. Pickens) (11/14/89)

On Nov 13, 11:45am, dupuy@cs.columbia.edu wrote:

= We run NTP, and it seems only to be able to keep the clock within about
= 100 ms of our local secondary time servers.  So there's still some bad
= stuff happening with them thar clocks.

I have a 4/370 (same cpu as a 4/330) which has a discontinuous clock.  I
have watched the offset of the clock vs. a stable reference with ntp.  I
see a slow drift, which is to be expected, with one second jumps
superimposed on it.  These jumps occur about every 225-235 minutes.  I
have checked another machine (4/330) and observered the same `effect'.
The net result is that the system time is badly broken.  You can't
stablize it with ntp and I would expect that it will break things in weird
and wondrous ways.

Here is the test setup:

		   4/370
		  /
                 /
      time server
       Sun 3/180
       	        \
		 \
		  4/280

The same measurement was run on both the 4/280 (SunOS 4.0.1) and on the
4/370 (SunOS 4.0.3).  The 4/280 shows only a slow drift relative to the
reference system.  The 4/370 shows a slow drift and 1 second jumps
relative to the reference system.  This indicates that the problem is in
the 4/370 clock and not in the time server.

The same source code was used to build ntp on both the 4/370 and 4/280.

This has been pending with Sun software support for over a month.  Here is
the feedback I got a couple of weeks ago:

>From Sun (31 Oct 89):

= This problem has been assigned bugid number 1029022. Engineering is looking
= at the problem as a possible problem with the Sun-4/330 software and/or
= hardware. At this time, I can give you no other status information nor
= schedule for the resolution of the problem.

	-keith
	 ksp@maxwell.nde.swri.edu
	 maxwell!ksp

PS: I have some nice data on this problem.  If anyone wants it send me mail.