mdapoz@hybrid.uucp (Mark Dapoz) (05/09/90)
Here's yet another story of an obscure problem I just came across on my 3B1. First the situation. When I arrived home from work today I noticed that the DTR light on the 'blazer was off and my 3B1 wasn't answering any incomming calls. Upon doing some initial investigation I found that I no longer had a uugetty process running on the modem port. This struck me as strange since init is responsible for restarting the uugetty and there was no indication of problems such as "process respawning too quickly" (and I haven't touched uugetty or inittab in quite a while now). I tried sending init a signal to switch to run level 2 (which it should already be in) but that had no effect, it seemed init was somehow hung. I was in a bit of a rush to get out tonight (a sure sign that something is going to go wrong :-) so I just decided to reboot the machine in the hopes that it would fix itself. Well, when I issued a shutdown command the system didn't even attempt to shut down, thus confirming my suspicion that init was dead. I did a manual shutdown and managed to bring the system down in an orderly fashion. However, upon rebooting the system failed to initialise completely and it seemed to get hung just after init was started. Wonderful, now I have a completely dead system instead of one with just a dead init. Over the next 2.5 hours I then tried every possible remedy to get it to completely boot up. I installed a new init from the distribution disks, recreated the inittab, went back to previous known working kernels, installed a backup copy of the shared library, ran every possible hardware test, started pulling cards, etc. but nothing worked. The system just kept stopping once init was started. If I removed init, then upon booting I would get a shell so the kernel seemed to be ok, its just init was very sick. I ended up digging through the man pages for init to figure out exactly what it did upon boot (maybe I was missing something all these years). The man page makes a reference to /etc/wtmp and /etc/utmp as files used by init to log information. I then checked these files and found, to my surprise, a file called /etc/utmp.lck! Ah ha, a lock file! It seems init at some point created a lock file for utmp and it never removed it. It also seems that init isn't quite bright enough to know that it should remove this file upon boot so it just happily sat there waiting for it to disappear. Of course once I removed the lock file my system booted quite happily. Now, why would init ever create a lck file and why doesn't it know about removing it when the system boots up. It was quite frustrating to spend 3 hours on a floppy based unix digging around to find this. I hope this experience may help someone else if they ever have the misfortune of getting into such a situation. -- Managing a software development team | Mark Dapoz is a lot like being on the psychiatric | mdapoz%hybrid@cs.toronto.edu ward. -Mitch Kapor, San Jose Mercury | ...uunet!mnetor!hybrid!mdapoz