[comp.realtime] Time Problem on OS-9 Systems

herbert@cernvax.UUCP (herbert walseth) (07/03/90)

Help!

After more than four months of uninterrupted operation, the real time
clock on our OS-9 systems is suddenly causing us problems.  Below is an
attempt to describe the situation.  We are quite stuck with the problem
and would highly appreciate all kinds of advise from the net.

I am cross posting this to comp.realtime and comp.os.os9, both general
and OS-9 specific comments are welcome.

[Sorry about the length of the posting, but I'm trying to include
all relevant information and that is not easy when you don't know 
where the problem lies.]

The systems read the time from a real-time clock card when they boot. 
From then it is up to the OS-9 software to keep the time correct. 
This worked fine and the time was stable for months. But during the 
last two weeks, the time on three of the systems have suddenly started 
to slow down. They are all steadily loosing approximately ten minutes per
day.

There might be several obvious reasons for this. We did have a similar
problem on one of our systems before and that was due to a hardware 
problem on the cpu card. But we find it most unlikely that three cards 
should break down almost at the same time, more than a year after 
the installation.

The malfunction might also be caused by some problems with the power 
supply, the area went through a period of heavy thunderstorms some 
time ago. But our systems are powered from LARGE batteries that should 
filter out all spikes from the mains. Sudden changes in temperature 
etc. are also quite unlikely 100 meters under ground.

The three systems are standing several kilometers away from each other 
and are not directly connected in any way. They have been doing exactly 
the same tasks during this period and no operational changes that 
can explain the situation have been made.

We are not moving our VME crates around at high speeds.

After eliminating all more or less obvious faults that we could think 
of, we went to the less likely ones. The only possibility that we 
have left, and that we would be very interested in getting some feedback 
on, is if the long uninterrupted period of operation may have caused 
an overflow somewhere in the OS-9 software and that this might cause
the system to slow down.

All systems have the same software installed. There are around 25 
processes running concurrently and although they are sleeping most 
of the time, some of them have accumulated 100+ cpu hours and made 
some hundred million calls to the kernel. There are also one or to 
short-lived processes forked every second. This should have accumulated 
to around 15-20 million forks by now. Our installed systems are running
OS-9 Version 2.2.

What really made us suspect the software is the following: During 
a shutdown of the accelerator we rebooted one of the systems, used 
setime to correct the time on the second and left the last one untouched. 
After this, the one that was rebooted started to run normally, while 
the two others continued to loose time.

We find it quite hard to debug this problem. Obviously we cannot plug 
in a logic analyser etc. since a reboot will cure (hide) the problem. 
But the systems have a floppy so we can install test programs. I have 
checked system global variables like ticks per second, ticks since 
reboot etc. and they all look OK. 

What really made me confused is the following experience: I made a 
small program that reads the ticks since reboot and seconds left until 
midnight. It then sleeps for 0x144510 ticks, or 3 hours, and calculates 
the number of ticks and seconds that have passed while it was asleep. 
On all systems, the expected 0x144510 ticks have passed, but the number 
of seconds that have passed according to the D_Seconds variable is 
different. The amazing thing is that on the systems where the clock 
is running incorrectly, the elapsed time is 0x2a30 seconds or 3 hours 
as it should be. On the systems where the clock is correct, however, 
0x2a82 seconds have passed!

Here is a description of the hardware:
The cpu board has a 68020 cpu, 68881 fpcp, 1 MByte of SRAM and 512 
KByte of EPROM. There is also some RAM and ROM on a real-time clock 
card. (This clock is only used to set the time when the system boots.) 
The systems have a 40 MByte hard disk and a floppy disk drive (Not 
used during normal operation.) The cpu card and disk controller are 
made by PEP Modular Computers. There are some inhouse made I/O card 
and a VME bus status card. The systems are connected to the rest of 
the world through a Mil-1553B line. 

Any of the components might of course break down and cause problems,
but why should it happen on three systems at the same time? And how 
could a reboot solve this problem?

The other systems that do not have the same problems are both less 
heavily loaded and have been running continuously for a shorter time 
period.


This is the best description of our problem that I am able to come up
with.  We would be most thankful if you have any ideas about what we could
look for (and how to look without rebooting).  We would also like to hear
from other groups who have had their OS-9 systems running for a similar
period and who have/have not seen the same problem.

Thank you in advance.

--
	     Herbert Walseth, herbert@cernvax.cern.ch
	     TIS Division, CERN, CH-1211 Geneva, Switzerland
	     Phone: +41 22 767 2634, Fax +41 22 785 2208

   No problem is so big or so complicated that it can't be run away from

herbert@cernvax.UUCP (herbert walseth) (07/05/90)

>In article <2031@cernvax.UUCP> I described a problem with the real time
>clock on our OS-9 systems. After months of operation, the clock
>suddenly started to slow down.

To all of you who have spent the last nights awake trying to figure out
what was happening: You may now start to sleep again, the problem has been
solved!

The cpu board manufacturer finally took us seriously and I just had a fax
back where they admit a "small bug in our clock driver".  It will overflow
after around four months of operation and the clock will then no longer be
correct.

I am at any time ready to discuss the size of this bug with them, but
that is another story. Hopefully they will have an upgrade ready soon.

This bug should concern all users of the VMPM 68KC and VMPM 68KC-2
cpu boards from PEP Modular Computers. If you intent to have your
system continuously operational for more than four months, get hold
of the new clock driver first!

+----------------------------+-------------------------------------------+
|                            |                                           |
|  Herbert Walseth           |   No problem is so big or so complicated  |
|  herbert@cernvax.cern.ch   |   that it can't be run away from.         |
|                            |                                           |
+----------------------------+-------------------------------------------+

knudsen@cbnewsd.att.com (michael.j.knudsen) (07/06/90)

In article <2042@cernvax.UUCP>, herbert@cernvax.UUCP (herbert walseth) writes:
>  If you intent to have your
> system continuously operational for more than four months, get hold
> of the new clock driver first!

A bug that bites only after 4 months of continuous operation
without a reboot?

Says something about OS9/K that such a problem should ever come up.
Also about the hardware used (except for the clock bug).
I wonder how many other OSes stay up that long...?
-- 

"Round and round the while() loop goes;
        Whether it stops," Turing says, "no one knows."

mcculley@alien.enet.dec.com (07/13/90)

In article <1990Jul5.190532.17243@cbnewsd.att.com>, knudsen@cbnewsd.att.com (michael.j.knudsen) writes...
> 
>A bug that bites only after 4 months of continuous operation without a reboot?
> [...]
>Says something about OS9/K that such a problem should ever come up.
>Also about the hardware used (except for the clock bug).
>I wonder how many other OSes stay up that long...?
>-- 

Guess it says something about various expectation settings that I'm more
surprised that four months is long enough to be noteworthy than I am that
hardware or software stayed up that long.  

I *-> expect <-* production systems (hardware and software) to be capable of
staying up indefinitely, unless I do something to cause them to be otherwise.

Why would anyone expect otherwise?

- Bruce McCulley
  Digital Equipment Corp.
  Corporate Software Engineering
  Nashua, NH USA

herbert@cernvax.UUCP (herbert walseth) (07/13/90)

In article <13391@shlump.nac.dec.com> mcculley@alien.enet.dec.com writes:
>
>In article <1990Jul5.190532.17243@cbnewsd.att.com>, knudsen@cbnewsd.att.com 
(michael.j.knudsen) writes...
>> 
>>A bug that bites only after 4 months of continuous operation without a reboot?
>> [...]
>>Says something about OS9/K that such a problem should ever come up.
>>Also about the hardware used (except for the clock bug).
>>I wonder how many other OSes stay up that long...?
>>-- 
>
> [...]
>
>I *-> expect <-* production systems (hardware and software) to be capable of
>staying up indefinitely, unless I do something to cause them to be otherwise.
>
>Why would anyone expect otherwise?
>

IMHO, both yes and no.

Hardware do have a limited life time and that is something we've got to
live with.  In particular, hard disks and other devices with moving parts
are bound to break down sooner or later.

Some might say that a MTBF of 40.000 hours is as good as infinity.  But if
you have 100 systems continuously running you must be ready to replace a
disk every fortnight.

On the other hand, I do not _expect_ software to have a _limited lifetime_.
In particular real-time software, both OS and applications, that are
supposed to run continuously should be able to do this.  There should not
be a slowly incrementing counter, a slow eating of memory or other time
bombs hidden in the system that sooner or later will cause it to crash.

This is not acceptable no matter how unlikely the software designer thinks
it is that the limit will ever be reached.  Sooner or later, someone is
going to reach it.  And with real-time systems, the result might be
disastrous.

It is of course very hard to verify that a system is free of time bombs.
(Impossible, one might say.) By pushing the system hard in the lab, months
of normal operation may be simulated over a weekend or so.  But not
everything is equally easy to simulate.  This thread was started by a
posting where I asked for help with a clock problem <2031@cernvax.UUCP>.
It turned out that due to a bug in the clock driver, the the clock would
start to slow down after around four months of continuous operation.  It
is not obvious to me how a user can protect himself against problems like
this, and this particular bug thought me a lesson.

PS. Sorry if I post this twice, something funny happend the first time
I tried. (And the system has only been up for two days :-).)

+----------------------------+-------------------------------------------+
|                            |                                           |
|  Herbert Walseth           |   No problem is so big or so complicated  |
|  herbert@cernvax.cern.ch   |   that it can't be run away from.         |
|                            |                                           |
+----------------------------+-------------------------------------------+