tpk@uunet.uu.net (Ted Kyriakakis) (12/03/89)
For the past week we have been having file system corruption problems on our mail gateway system. The system is a 3/180 with 2 Fujitsu 451 disk controllers (the first controller has an eagle and a cdc 9771 drive, the second controller has a super-eagle), and a mag tape drive. It has been running SunOS 4.0.3 for several months and supports several diskless clients. Our problems curiously only seem to happen during the weekend (the past two). What's been happening is that the xy0 (eagle) disk seems to be getting munged. The operating system will report generic "I/O errors", we reboot the system, and the subsequent fsck produces DUP block errors on practically every partition on xy0. Occasionally we will also get an xy1 "offline" error, but the xy1 file systems are fine. The disk on the second controller never has problems. When I first looked into the matter, I discovered that vmunix had been modified (corrupted?). I am not sure whether this is just another symptom due to the file system damage or whether this may be the cause. We now keep a spare vmunix on another disk and use it to compare against vmunix as a flag to notify us that the system is about to or is having problems. As long as someone is around, this helps us minimize the file system corruption that occurs. But we still do not know what is causing us the problem. I can think of three possible general causes: 1) hardware (CPU, xy0 disk, the disk controller, or the disk cables): But we have not been getting any hardware errors being reported and the disk drive has not been indicating any faults. I would think this type of problem would cause more regular (daily) problems. 2) OS software problem (SunOS 4.0.3 bug): But we were running for about 3 months without any noticeable problems. 3) virus or outside break-in: But it has not spread to any of our other hosts which would be quite simple once the mail host was breached. Or it could be a combination of the above or one of the above which is precipitated by some other external factor such as a power surge. As you can probably tell, I am grasping for straws. If you have experienced similar problems, or know of problems with SunOS 4.0.3 which could be the cause, or know of any viruses or break-ins going on which have exhibited similar symptoms, please let me know. I will summarize the responses to the newsgroup if interest warrants. In the meantime, we will be waiting and watching to see if we can pinpoint the circumstances surrounding the start of the problem. And if that fails to turn up anything, I guess we will start switching out the hardware and/or going back to the previous version of the SunOS.