[comp.sys.sgi] fsck

W0L@PSUVM.BITNET (Bill Lasher) (11/10/89)

Some of you may have been following the fsck question I posted last week.
Thanks to help from several of you, including some people at SGI, I finally
decided the REAL problem was our system administration.  One of the people at
SGI thought the following might be of interest to others, and suggested I post
it.

The original note follows:
=========================================================================
Date:    9 November 1989, 14:16:04 EST
From:    Bill Lasher                (814) 898-6391  W0L      at PSUVM
Subject: Re: fsck, init state 3
To:      dunlap at sgi.sgi.com
In-Reply-To:  dunlap%bigboote.csd AT sgi.com   -- Thu, 9 Nov 89 11:06:55 PST

Our most recent problem (the RPC timeout) I think was caused by the way
we implemented the nightly reboot.  We scheduled them 5 minutes apart,
figuring that would be enough time.  I found out today that one machine
was still in the process of restarting when the YP server he was
communicating with started to reboot.  This caused the system to hang.
Rebooting did in fact clear things up, but it took some time.  Part of
the problem is that the time on each machine is not exactly the same (a
diference of a couple of minutes).  We are going to set all machines to
the same time, and change the reboot interval to 10 minutes.

I think we got thrown off the track because running fsck nightly changed
the total time it took for the systems to reboot, and things just
happened to work out O.K.  Also, we probably weren't patient enough
earlier to let reboot do it's thing; when reboot didn't work, we tried
fsck, which did work because it took longer to finish up, and by the
time it was done the network wasn't as busy (or something like that.)  I
think we were also in a hurry to get things fixed, and as a result got
sloppy (ie, running fsck without unmounting, etc.).

Some of our problems may come back, but we will handle each of them
separately as they occur, and try to be more careful.  I suspect some of
the earlier problems (the full disks, hung spool queues) showed up
because we were letting the systems run for a week at a time without
rebooting, and things just got a little messy.  We had planned from the
beginning to have them reboot every night, but we had too many other
things going on to get it implemented.

We'll just take it from here and see what happens.

Best regards,

Bill
========================================================================
END OF ORIGINAL NOTE

You may not follow all the details, but you probably get the general idea.
I think it's a good example of what can happen when an experienced computer
user gets his first UNIX/networked system.

Bill

"If I knew what I was doing, I wouldn't have had to ask the question!"

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (11/10/89)

In article <89313.153421W0L@PSUVM.BITNET>, W0L@PSUVM.BITNET (Bill Lasher) writes:
> ... Part of
> the problem is that the time on each machine is not exactly the same (a
> diference of a couple of minutes).  We are going to set all machines to
> the same time...

Timed(1m) or timeslave(1m) can be used to synchronize time on the
machines.  In a simple network (i.e. without gateways separating machines),
just turning it on should be enough to keep time synchronized to within
75msec.

The next release after 3.2 will have improvements to the deamons and the
kernel which should allow timed or timeslave to keep time 10 or more times
tighter.

Time-skew of seconds, not to mention minutes, is a royal pain for make(1)
combined with NFS.

It should be noted that the 4.3 BSD release of timed left something to be
desired in both implementation and protocol.  The implementation in the
newer "network" release may be better, although the protocol, having been
released, is unfixable.  The IRIX version is heavily hacked from the 4.3
release, and seems to servicable.  It's been about a year since time went
crazy among the thousands of IRIS's in the SGI network for reasons other
than operator error.


Vernon Schryver
Silicon Graphics
vjs@sgi.com

ciemo@bananaPC.sgi.com (Dave Ciemiewicz) (11/10/89)

In article <89313.153421W0L@PSUVM.BITNET>, W0L@PSUVM.BITNET (Bill
Lasher) writes:

> Our most recent problem (the RPC timeout) I think was caused by the way
> we implemented the nightly reboot.  We scheduled them 5 minutes apart,
> figuring that would be enough time.  I found out today that one machine
> was still in the process of restarting when the YP server he was
> communicating with started to reboot.  This caused the system to hang.
> Rebooting did in fact clear things up, but it took some time.  Part of
> the problem is that the time on each machine is not exactly the same (a
> diference of a couple of minutes).  We are going to set all machines to
> the same time, and change the reboot interval to 10 minutes.
>
You might consider running timed on all of your machines which have it.
Timed will average the network time of all other machines on your local
area network which are running timed.  Timed will then make incremental
time adjustments to your machine's time to bring it into line with the
network average.

You can configure your 4D to run timed after rebooting by doing:

	su
	/etc/chkconfig timed on
	exit

You can manually run timed by doing

	su
	timed -M
	exit

This should help prevent the drift problem you are currently seeing.
Unfortunately, if you only have one machine running timed, it won't
do you much good to timed.
 
See the timed manual page for more information.

							--- Ciemo