[net.unix-wizards] Root filesystem bad free list problem -- HELP!! {whimper}

rcj@burl.UUCP (R. Curtis Jackson) (04/25/84)

We are running USG 5.0 on a Vax 11/780.  Every morning for several
months, the same thing -- we come in and the 'fsck -n' executed
during our filesystem backup tells us that we have a BAD FREE LIST
or MISSING BLOCKS IN FREELIST on our root filesystem.  If we use
fsck to fix it and reboot, things are usually OK; sometimes we
can't even go back to 'unix' [single-user] from stand-alone because
the thing 'panic: trap's on us and we have to come up on a backup.

We have tried the following and more:
a) getting a good root [/unix] on /dev/rrp0, making a totally new
filesystem on /dev/rrp20, volcopy/cpio [tried both] onto the new
filesystem, and come up on that one.  No dice.
b) reconfig and remake unix completely, put it on a virgin filesystem.
Nothing.
c) I thought that maybe our /tmp filesystem was screwy and was causing
things to be written incorrectly, so I run fsck on that daily.  No
problems there.
d) I've checked our swap setting in our tunables to make sure that it
was not overwriting -- although we hardly ever swap anyway -- no problem.
e) etc. etc. etc.

Also, at odd intervals, unix decides that it has a duplicate block
(so I have been told) in root's freelist and invalidates the entire
freelist, leaving us instantly with 'Out of space' errors everywhere.

I have heard that one of the Denver ATTIS systems had a similar problem
with their /usr filesystem; can anyone anywhere shed any light on any
aspect of any of these problems?

Tomorrow night we sacrifice a goat and place its entrails on the CPU
at the stroke of midnight......
-- 

The MAD Programmer -- 919-228-3313 (Cornet 291)
alias: Curtis Jackson	...![ ihnp4 ulysses cbosgd clyde ]!burl!rcj

alanr@drutx.UUCP (RobertsonAL) (04/27/84)

This is almost certainly due to a bug in the concurrency management
in the UNIX free list handler.
We experienced the same problem here for two months!! on 3-8 machines
involving PDP-11's running 3.0, vaxen and 3b-20's running System 5.
It is fixed in UNIX/370 (where concurrency is a BIG issue -- many
users, lots of CPUs, and a non single-thread kernel),

and (if I recall correctly) the problem went like this:
	
	1)	superblock free list runs out, and it 
			begins getting rebuilt by the O/S
while
	2) someone allocates, then frees a block before 1 completes.


	This fouls things up, since the free list is not locked
	for the duration of the rebuild.

	UNIX/370 fixed this by locking the free list for the duration
		of 1) above.  This makes response time on the machine
		occasionally glitch, while the free list gets rebuilt.

We tried the goat, it had absolutely no effect (except for a residual
smell, we can't seem to get rid of).

We NEVER FIXED THE PROBLEM -- It came by itself, then it went away by
itself, without us ever being able to track it down to the exact logic
that was causing the problem.  And we did try mightily.
The UNIX/370 folks I talked to indicated that this same problem exists
in EVERY version of UNIX since Version 7.

You just get lucky almost all of the time, until you don't get lucky
anymore.  If you want to file this problem with WECO, please call me,
and we'll gladly substantiate your claim.  THIS OUGHT TO GET FIXED!!!

	-- Alan Robertson
	   ihnp4!drutx!alanr
	   AT&T Information Systems Laboratories
	   Denver, Colorado
	   Room 31Y-27, x4796