[net.bugs.4bsd] fsck problem

ronnie@mit-eddie.UUCP (Ronnie Schnell) (01/13/86)

I'm having trouble with my vax.  There is a corrupted directory and
fsck says:

SALVAGE?

And no matter what I type it just returns to the cshell and it doesn't
seem to fix anything.



			#Ron (ronnie%sutcase.bitnet@wiscvm.wisc.edu)
			(ronnie@mit-eddie.uucp)

bzs@bu-cs.UUCP (Barry Shein) (01/16/86)

Re: fsck reports corrupted directory, asks to SALVAGE and then exits
on any answer.

I believe I have seen this behavior, rather than attempt to fix fsck
for you the following *might* work, but the risk is yours...

Safety First: Do you have a reasonable backup of the partition? If not
maybe you better figure out how to do that now. I might try going
single user, mounting the system (maybe read-only) and seeing what
appears to be there. Note that there is a possibility a crash may
result causing more problems so you might want to back-up root first
(tho if you don't touch any files on root while you do this and sync
root *before* mounting everything should be ok even if you do crash.)

Dump often does not work on an addled file system (I wouldn't even try
as the dump tape created might be unusuable even if it thinks it's ok.)
Tar is a possibility, perhaps avoiding that addled directory (ie. Tar
the directories around it.) Another possibility is to grab another,
healthier disk area and try moving things over there. Another possibility
is to say 'Hey, we did a backup last night, if I lose it I lose it...'

Another last resort is to just DD the entire raw device to tape, at least
you could possibly start all over again (tho hours could be wasted.) Also,
if desperation set in you could try to recover files from the DD tape
later (I don't envy this but some straightforward wizardry might make
this do-able, a friend once wrote a tiny shell-like thing which did
'ls', 'cd' 'cp' and 'cat' using a raw device as input, that was V6 tho
[why he wrote it I leave to your imagination.])

Now, being as you have planned the backout, I would try to locate the
addled directory, move out of it what I can, delete the directory in
question, dismount and re-run fsck. In desperation you could clri the
directory. Chances are good fsck will then figure things out EXCEPT
if you are having hardware troubles (a bad block in the directory area)
tho I think fsck would report this (maybe.)

Something that has happened to me and is very unnerving is a situation
where the structure seemed empty!! After a little playing around and
much brain-games with fsck we hadn't lost a thing other than my sanity
(well, a few files landed in lost+found which of course didn't exist
till I kinda forced my way with that disk.)

I would also let any suggestions like this congeal a little to see if
someone finds a flaw or a much better way. On the other hand, your users
may not be terribly patient. The problem is that with the hardened
filesystem on 4.2 you probably are going to have to seriously consider
a hardware problem, that is often what it comes down to tho it could have
been an unfortunately timed power hit.

Good luck. It also might be a good day to announce that you are paying
for next year's xmas party...:-)

	-Barry Shein, Boston University

(P.S. I posted this to see if others found a major flaw I guess,
also, being as you are right across the river feel free to call me if
things get too hairy, the price is your first-born.)

(P.P.S. Glancing at that area of fsck.c there seem to be a few
conditions which could cause this to happen, like a calculation in
fsck_readdir yielding a zero (NULL) to dirscan and the filesize at
that point appearing to be zero tho it's hard to check.)

wje@daisy.UUCP (William J. Earl) (01/17/86)

>>Re: fsck reports corrupted directory, asks to SALVAGE and then exits
>>on any answer.
>
>I believe I have seen this behavior, rather than attempt to fix fsck
>for you the following *might* work, but the risk is yours...
>
> ...
>
>(P.P.S. Glancing at that area of fsck.c there seem to be a few
>conditions which could cause this to happen, like a calculation in
>fsck_readdir yielding a zero (NULL) to dirscan and the filesize at
>that point appearing to be zero tho it's hard to check.)

    At our site, we fixed a bug in dirscan().  Just inside the for loop
which calls fsck_readdir(), we added:

	if (dp->d_reclen > DIRBLKSIZ) /* force it to be <= DIRBLKSIZ */
		dp->d_reclen = DIRBLKSIZ;

(This is added just before the statement "dsize = dp->d_reclen;".)
Without the check, fsck failed when it encountered a corrupted
directory block, even though it recognized it as corrupted,
as it tried to salvage directory entries.  We did not have
time to work on it further, but there are probably other such 
cases where certain kinds of garbage cause fsck to fail, due to
it not being suspicious enough.

-- 
	William J. Earl
	Daisy Systems Corporation, Mountain View, CA