herbert@cernvax.UUCP (herbert walseth) (07/03/90)
Help! After more than four months of uninterrupted operation, the real time clock on our OS-9 systems is suddenly causing us problems. Below is an attempt to describe the situation. We are quite stuck with the problem and would highly appreciate all kinds of advise from the net. I am cross posting this to comp.realtime and comp.os.os9, both general and OS-9 specific comments are welcome. [Sorry about the length of the posting, but I'm trying to include all relevant information and that is not easy when you don't know where the problem lies.] The systems read the time from a real-time clock card when they boot. From then it is up to the OS-9 software to keep the time correct. This worked fine and the time was stable for months. But during the last two weeks, the time on three of the systems have suddenly started to slow down. They are all steadily loosing approximately ten minutes per day. There might be several obvious reasons for this. We did have a similar problem on one of our systems before and that was due to a hardware problem on the cpu card. But we find it most unlikely that three cards should break down almost at the same time, more than a year after the installation. The malfunction might also be caused by some problems with the power supply, the area went through a period of heavy thunderstorms some time ago. But our systems are powered from LARGE batteries that should filter out all spikes from the mains. Sudden changes in temperature etc. are also quite unlikely 100 meters under ground. The three systems are standing several kilometers away from each other and are not directly connected in any way. They have been doing exactly the same tasks during this period and no operational changes that can explain the situation have been made. We are not moving our VME crates around at high speeds. After eliminating all more or less obvious faults that we could think of, we went to the less likely ones. The only possibility that we have left, and that we would be very interested in getting some feedback on, is if the long uninterrupted period of operation may have caused an overflow somewhere in the OS-9 software and that this might cause the system to slow down. All systems have the same software installed. There are around 25 processes running concurrently and although they are sleeping most of the time, some of them have accumulated 100+ cpu hours and made some hundred million calls to the kernel. There are also one or to short-lived processes forked every second. This should have accumulated to around 15-20 million forks by now. Our installed systems are running OS-9 Version 2.2. What really made us suspect the software is the following: During a shutdown of the accelerator we rebooted one of the systems, used setime to correct the time on the second and left the last one untouched. After this, the one that was rebooted started to run normally, while the two others continued to loose time. We find it quite hard to debug this problem. Obviously we cannot plug in a logic analyser etc. since a reboot will cure (hide) the problem. But the systems have a floppy so we can install test programs. I have checked system global variables like ticks per second, ticks since reboot etc. and they all look OK. What really made me confused is the following experience: I made a small program that reads the ticks since reboot and seconds left until midnight. It then sleeps for 0x144510 ticks, or 3 hours, and calculates the number of ticks and seconds that have passed while it was asleep. On all systems, the expected 0x144510 ticks have passed, but the number of seconds that have passed according to the D_Seconds variable is different. The amazing thing is that on the systems where the clock is running incorrectly, the elapsed time is 0x2a30 seconds or 3 hours as it should be. On the systems where the clock is correct, however, 0x2a82 seconds have passed! Here is a description of the hardware: The cpu board has a 68020 cpu, 68881 fpcp, 1 MByte of SRAM and 512 KByte of EPROM. There is also some RAM and ROM on a real-time clock card. (This clock is only used to set the time when the system boots.) The systems have a 40 MByte hard disk and a floppy disk drive (Not used during normal operation.) The cpu card and disk controller are made by PEP Modular Computers. There are some inhouse made I/O card and a VME bus status card. The systems are connected to the rest of the world through a Mil-1553B line. Any of the components might of course break down and cause problems, but why should it happen on three systems at the same time? And how could a reboot solve this problem? The other systems that do not have the same problems are both less heavily loaded and have been running continuously for a shorter time period. This is the best description of our problem that I am able to come up with. We would be most thankful if you have any ideas about what we could look for (and how to look without rebooting). We would also like to hear from other groups who have had their OS-9 systems running for a similar period and who have/have not seen the same problem. Thank you in advance. -- Herbert Walseth, herbert@cernvax.cern.ch TIS Division, CERN, CH-1211 Geneva, Switzerland Phone: +41 22 767 2634, Fax +41 22 785 2208 No problem is so big or so complicated that it can't be run away from
herbert@cernvax.UUCP (herbert walseth) (07/05/90)
>In article <2031@cernvax.UUCP> I described a problem with the real time >clock on our OS-9 systems. After months of operation, the clock >suddenly started to slow down. To all of you who have spent the last nights awake trying to figure out what was happening: You may now start to sleep again, the problem has been solved! The cpu board manufacturer finally took us seriously and I just had a fax back where they admit a "small bug in our clock driver". It will overflow after around four months of operation and the clock will then no longer be correct. I am at any time ready to discuss the size of this bug with them, but that is another story. Hopefully they will have an upgrade ready soon. This bug should concern all users of the VMPM 68KC and VMPM 68KC-2 cpu boards from PEP Modular Computers. If you intent to have your system continuously operational for more than four months, get hold of the new clock driver first! +----------------------------+-------------------------------------------+ | | | | Herbert Walseth | No problem is so big or so complicated | | herbert@cernvax.cern.ch | that it can't be run away from. | | | | +----------------------------+-------------------------------------------+
knudsen@cbnewsd.att.com (michael.j.knudsen) (07/06/90)
In article <2042@cernvax.UUCP>, herbert@cernvax.UUCP (herbert walseth) writes: > If you intent to have your > system continuously operational for more than four months, get hold > of the new clock driver first! A bug that bites only after 4 months of continuous operation without a reboot? Says something about OS9/K that such a problem should ever come up. Also about the hardware used (except for the clock bug). I wonder how many other OSes stay up that long...? -- "Round and round the while() loop goes; Whether it stops," Turing says, "no one knows."
mcculley@alien.enet.dec.com (07/13/90)
In article <1990Jul5.190532.17243@cbnewsd.att.com>, knudsen@cbnewsd.att.com (michael.j.knudsen) writes... > >A bug that bites only after 4 months of continuous operation without a reboot? > [...] >Says something about OS9/K that such a problem should ever come up. >Also about the hardware used (except for the clock bug). >I wonder how many other OSes stay up that long...? >-- Guess it says something about various expectation settings that I'm more surprised that four months is long enough to be noteworthy than I am that hardware or software stayed up that long. I *-> expect <-* production systems (hardware and software) to be capable of staying up indefinitely, unless I do something to cause them to be otherwise. Why would anyone expect otherwise? - Bruce McCulley Digital Equipment Corp. Corporate Software Engineering Nashua, NH USA
herbert@cernvax.UUCP (herbert walseth) (07/13/90)
In article <13391@shlump.nac.dec.com> mcculley@alien.enet.dec.com writes: > >In article <1990Jul5.190532.17243@cbnewsd.att.com>, knudsen@cbnewsd.att.com (michael.j.knudsen) writes... >> >>A bug that bites only after 4 months of continuous operation without a reboot? >> [...] >>Says something about OS9/K that such a problem should ever come up. >>Also about the hardware used (except for the clock bug). >>I wonder how many other OSes stay up that long...? >>-- > > [...] > >I *-> expect <-* production systems (hardware and software) to be capable of >staying up indefinitely, unless I do something to cause them to be otherwise. > >Why would anyone expect otherwise? > IMHO, both yes and no. Hardware do have a limited life time and that is something we've got to live with. In particular, hard disks and other devices with moving parts are bound to break down sooner or later. Some might say that a MTBF of 40.000 hours is as good as infinity. But if you have 100 systems continuously running you must be ready to replace a disk every fortnight. On the other hand, I do not _expect_ software to have a _limited lifetime_. In particular real-time software, both OS and applications, that are supposed to run continuously should be able to do this. There should not be a slowly incrementing counter, a slow eating of memory or other time bombs hidden in the system that sooner or later will cause it to crash. This is not acceptable no matter how unlikely the software designer thinks it is that the limit will ever be reached. Sooner or later, someone is going to reach it. And with real-time systems, the result might be disastrous. It is of course very hard to verify that a system is free of time bombs. (Impossible, one might say.) By pushing the system hard in the lab, months of normal operation may be simulated over a weekend or so. But not everything is equally easy to simulate. This thread was started by a posting where I asked for help with a clock problem <2031@cernvax.UUCP>. It turned out that due to a bug in the clock driver, the the clock would start to slow down after around four months of continuous operation. It is not obvious to me how a user can protect himself against problems like this, and this particular bug thought me a lesson. PS. Sorry if I post this twice, something funny happend the first time I tried. (And the system has only been up for two days :-).) +----------------------------+-------------------------------------------+ | | | | Herbert Walseth | No problem is so big or so complicated | | herbert@cernvax.cern.ch | that it can't be run away from. | | | | +----------------------------+-------------------------------------------+