jpl@allegra.att.com (02/28/89)
We ran into a similar problem with the 4.3 fsck. The basic problem was this. fsck waited for parallel fscks to complete using if (preen) { union wait status; while (wait(&status) != -1) sumstatus |= status.w_retcode; } However, if the process terminated abnormally, retcode was 0, so fsck failed to detect the error. We changed the code to be if (preen) { union wait status; while (wait(&status) != -1) { if (status.w_termsig) { printf("child died with signal %d during pass %d\n", status.w_termsig, passno); sumstatus |= 8; } else sumstatus |= status.w_retcode; } } This treats abnormal termination (MUCH more serious than a bit of file system corruption) as an error as well. How could a process terminate abnormally, you might ask? There's a line in pass1 that looks like ndb = howmany(dp->di_size, sblock.fs_bsize); ndb (the number of data blocks) is subsequently used as an array index. But we found that with suitably huge di_size, howmany could make ndb go negative, so the array reference caused a dump. We cleaned that one up by adding the check... if (ndb < 0) { if (debug) printf("bad size %d ndb %d:", dp->di_size, ndb); goto unknown; } Until we put in these fixes, we had a file system that would make fsck drop core, but fsck -p didn't notice it, so the condition persisted for weeks. We finally caught the problem when we ran a fsck without the -p, and noticed that it died on that file system. John P. Linderman Department of Bounced fsck's allegra!jpl