alex@umbc3.UMD.EDU (Alex S. Crain) (07/06/88)
This is my experiance with major disaster recovery. It's kind of long. Well, I knew it was coming, but as usual I was totally unprepared for what happened. The senerio... I'm debugging a mail filter with adb. At some point adb halts on SIGALRM. I continue, but adb is unhappy and confused. I try to restart the program, adb breaks at a weird place, i continue, adb dumps core, and the kernal panics with panic:iaddress > 2^24 Now, I'm smart enough to figure that this is bad, so I'm not too surprised when the reboot dies with the same message. Well, to make a long story short, 6 hours later things are back to normal. My best guess is that the kernal through its guts up onto some random place in the file system, probably over things like the freelist, a few directories, etc. As far as *why* it did that, I have no clue. I've been playing with some custom syscalls, so it could either be a vicious kernal bug, or a vicious device driver bug. I'm not willing to continue testing to find out :-). But I'm not posting to complain, but rather share some experiance. I was faced with the problem of fscking the hard disk, and relaoding some of the software without reformatting. What I should have done was.... 1) boot floppy unix. floppy unix lives on disk 2 of the foundation set, and uses a floppy mounted file system. You can get a root shell by first saying that you do not want to save your files, but then changing your mind. The conversation goes like... Do you want to save any existing user files? no This action will destroy any data on the hard disk. do you want to continue? no # You can also substitue your own custom file system for disk 3 of the foundation set. Just mount disk3, cpio everything off, make a new mounted file system with your changes. don't forget /lib/shlib, It has to be there so that it can get loaded. once its loaded, it can go away. UNDOCUMENTED FEATURE #1: /unix doesn't have to use /etc/init. If /etc/init is not present, /unix runs the shell script /etc/profile in single user mode. If this script contains the line 'exec sh', you get a single user shell. So I should have had a custom floppy file system that would mount the hard disk and give me a root shell. (/etc/profile => mount /dev/fp002 /mnt; exec sh) 2) fsck the hard disk. fsck won't let you fsck a mounted file system. UNDOCUMENTED FEATURE #2: fsck will let you fsck a *raw* mounted file system. fsck isn't on the floppy, so we # /mnt/etc/fsck /dev/rfp002 3) delete init files on the hard disk # rm /mnt/etc/init # rm /mnt/etc/rc 4) make a startup file on the hard disk. # cat > /mnt/etc/profile exec sh ^d 4) boot hard disk from a floppy. this is disk 4 of the foundation set. It is just like /unix, but lives on a floppy. So you can boot from the hard drive even if /unix is corrupt. to do this, do # sync;sync;sync; # /etc/reboot "a message" "a message" [Hit any key to continue] the message string get reboot to issue the [hit any key...] message and wait till you put disk 4 in the drive. 5) now unix is running with a single user shell. Disks 5-12 contain a big cpio file of the foundation set. If you ever want to make a new foundation set, make a big cpio file of /, /bin, /dev /etc, etc, and use that when the system asks for disk 5. Anyway, put the foundation set in /tmp, and then recover whatever files you might need from the root shell. you will need good copies of /etc/init, /etc/inittab, /etc/rc, and whatever is referenced from rc. use /bin/sum to check file validity (the sums should match between new & old copies) Thats roughly how disaster recovery *could* go. It *did* go vaguely like that, with alot of redundancy, and many more fscks, since I found out all of this by trial & error. I hope this helps someone avoid some pain someday.-- :alex. Systems Programmer nerwin!alex@umbc3.umd.edu UMBC alex@umbc3.umd.edu