W0L@PSUVM.BITNET (Bill Lasher) (11/10/89)
Some of you may have been following the fsck question I posted last week. Thanks to help from several of you, including some people at SGI, I finally decided the REAL problem was our system administration. One of the people at SGI thought the following might be of interest to others, and suggested I post it. The original note follows: ========================================================================= Date: 9 November 1989, 14:16:04 EST From: Bill Lasher (814) 898-6391 W0L at PSUVM Subject: Re: fsck, init state 3 To: dunlap at sgi.sgi.com In-Reply-To: dunlap%bigboote.csd AT sgi.com -- Thu, 9 Nov 89 11:06:55 PST Our most recent problem (the RPC timeout) I think was caused by the way we implemented the nightly reboot. We scheduled them 5 minutes apart, figuring that would be enough time. I found out today that one machine was still in the process of restarting when the YP server he was communicating with started to reboot. This caused the system to hang. Rebooting did in fact clear things up, but it took some time. Part of the problem is that the time on each machine is not exactly the same (a diference of a couple of minutes). We are going to set all machines to the same time, and change the reboot interval to 10 minutes. I think we got thrown off the track because running fsck nightly changed the total time it took for the systems to reboot, and things just happened to work out O.K. Also, we probably weren't patient enough earlier to let reboot do it's thing; when reboot didn't work, we tried fsck, which did work because it took longer to finish up, and by the time it was done the network wasn't as busy (or something like that.) I think we were also in a hurry to get things fixed, and as a result got sloppy (ie, running fsck without unmounting, etc.). Some of our problems may come back, but we will handle each of them separately as they occur, and try to be more careful. I suspect some of the earlier problems (the full disks, hung spool queues) showed up because we were letting the systems run for a week at a time without rebooting, and things just got a little messy. We had planned from the beginning to have them reboot every night, but we had too many other things going on to get it implemented. We'll just take it from here and see what happens. Best regards, Bill ======================================================================== END OF ORIGINAL NOTE You may not follow all the details, but you probably get the general idea. I think it's a good example of what can happen when an experienced computer user gets his first UNIX/networked system. Bill "If I knew what I was doing, I wouldn't have had to ask the question!"
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (11/10/89)
In article <89313.153421W0L@PSUVM.BITNET>, W0L@PSUVM.BITNET (Bill Lasher) writes: > ... Part of > the problem is that the time on each machine is not exactly the same (a > diference of a couple of minutes). We are going to set all machines to > the same time... Timed(1m) or timeslave(1m) can be used to synchronize time on the machines. In a simple network (i.e. without gateways separating machines), just turning it on should be enough to keep time synchronized to within 75msec. The next release after 3.2 will have improvements to the deamons and the kernel which should allow timed or timeslave to keep time 10 or more times tighter. Time-skew of seconds, not to mention minutes, is a royal pain for make(1) combined with NFS. It should be noted that the 4.3 BSD release of timed left something to be desired in both implementation and protocol. The implementation in the newer "network" release may be better, although the protocol, having been released, is unfixable. The IRIX version is heavily hacked from the 4.3 release, and seems to servicable. It's been about a year since time went crazy among the thousands of IRIS's in the SGI network for reasons other than operator error. Vernon Schryver Silicon Graphics vjs@sgi.com
ciemo@bananaPC.sgi.com (Dave Ciemiewicz) (11/10/89)
In article <89313.153421W0L@PSUVM.BITNET>, W0L@PSUVM.BITNET (Bill Lasher) writes: > Our most recent problem (the RPC timeout) I think was caused by the way > we implemented the nightly reboot. We scheduled them 5 minutes apart, > figuring that would be enough time. I found out today that one machine > was still in the process of restarting when the YP server he was > communicating with started to reboot. This caused the system to hang. > Rebooting did in fact clear things up, but it took some time. Part of > the problem is that the time on each machine is not exactly the same (a > diference of a couple of minutes). We are going to set all machines to > the same time, and change the reboot interval to 10 minutes. > You might consider running timed on all of your machines which have it. Timed will average the network time of all other machines on your local area network which are running timed. Timed will then make incremental time adjustments to your machine's time to bring it into line with the network average. You can configure your 4D to run timed after rebooting by doing: su /etc/chkconfig timed on exit You can manually run timed by doing su timed -M exit This should help prevent the drift problem you are currently seeing. Unfortunately, if you only have one machine running timed, it won't do you much good to timed. See the timed manual page for more information. --- Ciemo