canoaf@ntvax.UUCP (Augustine Cano) (06/09/89)
Hi everyone! The problem: Programs that used to work, started not working. At first I attributed the problem to the latest (as far as I know) version of C-kermit, that I got from Columbia U. hoping that the major problems would be solved. Well, no luck. Kermit (on a 3b1) still does not exit without help from ^C, and, most disturbing, when the following was executed from a "take" file, it locked up so badly that the only way out was to kill its parent shell from another window. The work-around that used to work in the previous version did not work anymore. set modem att ! phtoggle set line /dev/ph0 set speed 1200 dial nnn-nnnn connect Well now, after rebuilding the system, it does not lock up anymore; it just does not set the modem properly. The funny thing is that the exact same sequence works fine when typed at the prompt. Am I overlooking something? Is anybody having the same problem? Is anybody using kermit on a UNIX PC? I believe that this is a genuine kermit problem. Am I wrong? Vtem (anyone out there using vtem, the VT100 emulator that uses the pty's?) also acted very strangely. I compiled it under install, and when I ran it while being install, it worked fine. Under another login it would lock up, not even echo appeared on the screen. The only solution was to kill its parent shell, just like kermit. Vtem also sometimes mapped the British character set since '#' appeared as the pound-sterling symbol. This was solved by logging out. Finally, one day, Lenny's sysinfo program stopped working after a brief power outage. At his suggestion, I checked whether the lipc driver was loaded and, sure enough it wasn't. It turned out that quite a few files were in inconsistent states. Many libraries were different from the distribution ones, notably libc.a. This is probably the result of trying to have shcc and ccc installed at the same time. At least a couple of files related to loadable drivers were also different and many ua files were inconsistent (2 entries for the same package in installed software, pty entry remaining after it was removed, etc.) The sysinfo problem was caused by the fact that loadavgd, when trying to start, would dump core in /etc/lddrv (someone mentioned this symptom some time ago) and therefore sysinfo could not communicate with it. Rather than fixing up individual files and risk missing some, I decided that it was time for a major overhaul. Not only will I end up with a guaranteed consistent system, I thought, but the HD fragmentation would go way down. The fragmentation did go down (from about 16.00 % to under 2 %) but I don't want to have to do this again, EVER! (unless of course someone comes up with an automated script to do it.) At first, I thought: no big deal! just backup the whole thing, boot from floppy and restore everything. After 10 seconds though, it became obvious that you either restore everything unconditionally, putting back the corrupted files where they were, or you have to reconfigure the system manually when you're done. This would mean re-creating the groups, users, links and configuration from scratch, as well as finding out about each and every file that didn't come in a distribution set. The solution: 1 - Remove all installed packages from install (it is here that I found out about the inconsistent ua files.) 2 - Backup /u (all users): find /u -print | cpio -oBcv > /dev/rfp021. 3 - MAKE SURE THAT THE CPIO SET IS READABLE: cpio -ictB < /dev/rfp021 for this and all future cpio sets. This might save you a lot of grief. When I first tried to restore the whole HD, (a cpio set of 90+ floppies) cpio just quit at disk 74. After trying a second time and failing at disk 71, I was getting pretty paranoid about losing irreplaceable data. It turned out that disks 71-91 were Kodak HD600 (96 TPI.) I would have thought that better disks would have no problem at lower track densities. Is there something inherent in the magnetism of the (thinner?) magnetic coating or the sensitivity of a 48 TPI drive head that make use of such floppies a hazard to your data? or did I just hit a few bad disks? In any case, using fc, I could copy the data from the 96 TPI disks (sometimes after many tries) to regular 48 TPI floppies. From then on there was no problem. 4 - Make one cpio set for each directory that does not exist in the distribution; in my case /usr/man, /usr/lbin, /usr/src, /usr/doc, /usr/local, /usr/games. 5 - Login as root. 6 - Delete all /u files: rm -r /u (I felt really funny doing this...) 7 - Delete all the directories backed up in 4. 8 - Do a find / -newer /bin/cat -print > /tmp/modified.files. This will make a list of all the remaining files that have been modified since the installation of the foundation set. 9 - Print this file and go through it, deleting any files that you know are in a package that is backed up on floppy. These files would still be there because they were not removed in step 1, probably because they came from a non-installable package. 10- Make a separate cpio set for each directory remaining on that list. In my case: /bin, /etc, /lib, /usr/bin, /usr/lib, /usr/mail, /usr/spool. Mark these clearly to the effect that they will have to be reviewed before restoring. 11 - Reboot floppy unix and install a clean foundation set. When asked if you want to wipe out the files on the HD, say yes. (how often do you get to willingly destroy everything on your HD? :-) 12 - Login as install and install the appropriate installable packages in the appropriate order: ie. Telephone, ATE, Curses/Terminfo end user package, GSS Drivers, Dev. set, Enhanced editors, Encryption set (the order of this one is important), etc.. 13 - Login as root and restore the cpio sets made in step 4: cpio -iBdcv < /dev/rfp021 for each of /usr/man, /usr/doc, /usr/local, etc... The idea of restoring these before /u is that, since these files are modified less than user files, they will stay packed and unfragmented closer to the beginning of the disk longer. Is this reasoning correct? 14 - Make whatever links you had that were not standard: ie. ln /bin/as /bin/mas, ln /bin/cc /bin/mcc, ln /usr/bin/compress /usr/bin/zcat, etc... 15 - cd /tmp 16 - One by one, restore the directories saved in step 10, REDIRECTING to the current directory: cpio -iBdcvR < /dev/rfp021 17 - For each of the directories in step 10, do: diff -r <name-of-directory> /<name-of-directory> > <name-of-directory>.diff. This will give you a list of which files were present in your old directory and not in the clean one (these you want to copy to the new), which files are in the new and not in the old (ignore these), and which are in both and if they are different. 18 - For each of the directories in step 10, edit the file /tmp/<name-of- directory>.diff. Delete the lines: "only in /<name-of-directory>". Copy the files on the lines: "only in <name-of-directory>" to the new one (/<name-of-directory>). For those that exist in both the old and the new, you'll have to decide whether to copy them or not. Unless you know what the file is for, and you're sure you want the old version, don't copy it. It is better to have to do some minor configuration later on than having a still corrupt system. In the case of /etc, the only files I copied from the old directory (now /tmp/etc) were /etc/daemons/*, /etc/group and /etc/passwd. It was in this step that I found out how many corrupted or inconsistent files I really had. 19 - After finishing each of the directories in step 10, cleanup /tmp. This will reduce external fragmentation (I think.) 20 - Install those packages that were in /usr/src. 21 - Do an unconditional restore of /u: cpio -iBdcvu < /dev/rfp021. Before doing this I saved /u as it was laid out by the foundation set in a /tmp file and then applied step 17 to that file. This, however is not necessary, since no user files were modified during installation. 22 - Reboot the system, YOU'RE DONE !!! One very minor problem is that links cannot be made across cpio sets. Cpio could not recreate the link of one /usr/src file to a /u file since /u was not on the same set. The only re-configuration I had to do was to set up the printer, the phone line and the screen blanking interval. This was done in 5 minutes and could have been avoided had I restored the files where this information is kept. Well, I hope this helps someone with a similar problem. Of course, if somebody decides to automate this procedure by putting it into a script, I would definitely like to see it. If someone has other ideas or comments on how this process could be simplified, I would also like to hear them. Augustine Cano canoaf@dept.csci.unt.edu