rcj@burl.UUCP (R. Curtis Jackson) (04/25/84)
We are running USG 5.0 on a Vax 11/780. Every morning for several months, the same thing -- we come in and the 'fsck -n' executed during our filesystem backup tells us that we have a BAD FREE LIST or MISSING BLOCKS IN FREELIST on our root filesystem. If we use fsck to fix it and reboot, things are usually OK; sometimes we can't even go back to 'unix' [single-user] from stand-alone because the thing 'panic: trap's on us and we have to come up on a backup. We have tried the following and more: a) getting a good root [/unix] on /dev/rrp0, making a totally new filesystem on /dev/rrp20, volcopy/cpio [tried both] onto the new filesystem, and come up on that one. No dice. b) reconfig and remake unix completely, put it on a virgin filesystem. Nothing. c) I thought that maybe our /tmp filesystem was screwy and was causing things to be written incorrectly, so I run fsck on that daily. No problems there. d) I've checked our swap setting in our tunables to make sure that it was not overwriting -- although we hardly ever swap anyway -- no problem. e) etc. etc. etc. Also, at odd intervals, unix decides that it has a duplicate block (so I have been told) in root's freelist and invalidates the entire freelist, leaving us instantly with 'Out of space' errors everywhere. I have heard that one of the Denver ATTIS systems had a similar problem with their /usr filesystem; can anyone anywhere shed any light on any aspect of any of these problems? Tomorrow night we sacrifice a goat and place its entrails on the CPU at the stroke of midnight...... -- The MAD Programmer -- 919-228-3313 (Cornet 291) alias: Curtis Jackson ...![ ihnp4 ulysses cbosgd clyde ]!burl!rcj
alanr@drutx.UUCP (RobertsonAL) (04/27/84)
This is almost certainly due to a bug in the concurrency management in the UNIX free list handler. We experienced the same problem here for two months!! on 3-8 machines involving PDP-11's running 3.0, vaxen and 3b-20's running System 5. It is fixed in UNIX/370 (where concurrency is a BIG issue -- many users, lots of CPUs, and a non single-thread kernel), and (if I recall correctly) the problem went like this: 1) superblock free list runs out, and it begins getting rebuilt by the O/S while 2) someone allocates, then frees a block before 1 completes. This fouls things up, since the free list is not locked for the duration of the rebuild. UNIX/370 fixed this by locking the free list for the duration of 1) above. This makes response time on the machine occasionally glitch, while the free list gets rebuilt. We tried the goat, it had absolutely no effect (except for a residual smell, we can't seem to get rid of). We NEVER FIXED THE PROBLEM -- It came by itself, then it went away by itself, without us ever being able to track it down to the exact logic that was causing the problem. And we did try mightily. The UNIX/370 folks I talked to indicated that this same problem exists in EVERY version of UNIX since Version 7. You just get lucky almost all of the time, until you don't get lucky anymore. If you want to file this problem with WECO, please call me, and we'll gladly substantiate your claim. THIS OUGHT TO GET FIXED!!! -- Alan Robertson ihnp4!drutx!alanr AT&T Information Systems Laboratories Denver, Colorado Room 31Y-27, x4796