mikulska@odin.ucsd.edu (Margaret Mikulska) (01/31/91)
Somebody with no net access asked me for help with apollos gone berserk after a power failure. I'm not familiar with their setup, but if anything here rings a bell, I'd appreciate a hint. Basically, they have two servers and some diskless nodes. After the power failure, the sys admin can't log in as root or do 'su' (getting 'login incorrect' or 'sorry'). However, everybody can log in as an ordinary user. Also, the ownership of files owned by root is shown by root's ID, i.e., 0, and not by its name 'root'. Some files 'can not be found'. I know this is a preposterously inadequate description, but that's what I have. The sys admin is lost and desperate. I myself know more about other UNIXes than about apollos. Perhaps some apollo guru among you recognizes what's going on even from this description. It seems rather obvious to me that some corruption of the file system has occurred - perhaps of the rgy files ? One can always go back to the distribution tapes and backups, and restore everything from scratch, but perhaps this is an overkill. Somebody suggested to me that the system might have reset the root passwd to the default password. I had no chance to check it this helps, but would this account for file ownership showing as belonging to "0" instead of "root" ? Any help/hint/advice appreciated. Thanks in advance. Margaret Mikulska UC San Diego mmikulska@ucsd.edu
krowitz@RICHTER.MIT.EDU (David Krowitz) (01/31/91)
Sounds like perhaps the registry daemons are either dead or non-communicative due to NCS problems. I believe that the machines all keep caches of recently used user-ID's in case they get cut off from the rgy, which may explain why some users can get logged in, but root (who presumably isn't logged in very often) can't. Check the rgyd replicas, the glbd and llbd replicas, and the time-of-day clocks (they need to be within a few minutes of each other on the nodes running glbd's). Try using "cat" on /etc/passwd -- if you can't list the file, then the registries are dead, since this file is really a type-manager file which invokes the rgyd to create a listing on the fly. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (02/01/91)
In article <16199@sdcc6.ucsd.edu> mikulska@odin.ucsd.edu (Margaret Mikulska) writes: >Basically, they have two servers and some diskless nodes. After the power >failure, the sys admin can't log in as root or do 'su' (getting 'login >incorrect' or 'sorry'). However, everybody can log in as an ordinary user. >Also, the ownership of files owned by root is shown by root's ID, i.e., 0, >and not by its name 'root'. Some files 'can not be found'. > >Somebody suggested to me that the system might have reset the root passwd >to the default password. I had no chance to check it this helps, but would >this account for file ownership showing as belonging to "0" instead of >"root" ? Having no password would not cause this, but having no root account in the registry would -- are you running a registry on this node, or at least somewhere in the network? If not, you should be; if so, your node is not making a connection to it. You should be able to 'cat' the /etc/passwd file and see all your accounts (root will be first). I am also assuming that a 'salvol' was done at reboot, either automatically if rebooted in "Normal" mode (which is set by a switch somewhere on the node), or manually if rebooted in "Service" mode. If the node is crashing a lot, you can add # # Start the disk updating daemon. # (echo Starting /etc/update >/dev/console) if [ -f /etc/update ]; then /etc/update fi to your /etc/rc file (just after "find orphans"), which has saved us a lot of lost/damaged files. If the lost files are critical, and it sounds like maybe some of them are, you may have to reload the node. -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775
thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (02/01/91)
(sorry for the bandwidth -- ucsd.edu didn't like me) Margaret -- > Basically, they have two servers and some diskless nodes. After the power > failure, the sys admin can't log in as root or do 'su' (getting 'login > incorrect' or 'sorry'). However, everybody can log in as an ordinary user. > Also, the ownership of files owned by root is shown by root's ID, i.e., 0, > and not by its name 'root'. Some files 'can not be found'. Sounds to me like the registry daemon isn't being started / starting up. If, when 'normal users' log on, the DM (display mangler) output window says something about "Unable to access Network Registry, using local registry" then that's the case (or else networking is really confused). You will sometimes get this "Unable to find..." message for the first bit, as the rgyd warms up, and as the local node starts looking on the network. > ... > It seems rather obvious to me that some corruption of the file system > has occurred - perhaps of the rgy files ? One can always go back > to the distribution tapes and backups, and restore everything from scratch, > but perhaps this is an overkill. Unlike standard Unix, Apollos use a registry daemon to service registry requests. The 'files' /etc/passwd, /etc/group, /etc/org, etc are really special objects that access the network registry daemons to get the information. That way, you can have one central location for rgy info, so it's always up-to-date. (In practice, you would normally set up a few 'slave' rgyd processes, so that you have redundancy in case of crashes. The rgyd processes exchange info with each other, and keep things up-to-date internally. In addition to the rgyd process(es), you must have a couple other processes hanging around to get/keep things working. You must have a local-location broker (llbd) on any node that offers NCS services (the rgyd uses NCS, and therefore offers services). You must also have a global-location broker (glbd) that maintains info from all the llbd processes. A circular loop is caused, since the glbd processes offer NCS, so they need to have an llbd running too. Here's the MINIMUM that you need to have running: On a master node (one of the 2 servers), run llbd, glbd, and rgyd. Here's a BETTER setup (IMHO): On each server node, run llbd, glbd, and rgyd. To start things up: llbd started in /etc/rc -- command is '/etc/ncs/llbd' glbd "" "" "" "" "" "" "" "" "" "" "" '/etc/ncs/glbd' to make the first one, '/etc/ncs/glbd -create -first') to make others, '/etc/ncs/glbd -create -from //node_with_glbd' rgyd started in /etc/rc -- command is '/etc/rgyd' to make a replica(slave), '/etc/rgyd -create' to restart a broken one, '/etc/rgyd -recreate' If the rgyd isn't starting up successfully, there may be problems, because you need to become root in order to fix the thing that lets you become root. He/she might want to try the default passwords, on the hope that its hiding in the local-registry (a simple file lookup) w/ an old password. If a glbd or an llbd isn't running, that fix is a lot easier. Just get on as someone who can add objects to the /etc/daemons stub-file directory, and add entries. (If a glbd has never been running, you're in the same boat as w/ the rgyd -- you need to be root to create a glbd). They might be able to put commands into the /etc/rc file to create things, reboot, and then un-edit the file back to its original state. > Somebody suggested to me that the system might have reset the root passwd > to the default password. I had no chance to check it this helps, but would > this account for file ownership showing as belonging to "0" instead of > "root" ? Probably not the case. It certainly wouldn't account for the '0' instead of 'root'. > Any help/hint/advice appreciated. Thanks in advance. Hope I helped, rather than babbled. You're welcome. -- jt -- John Thompson Honeywell, SSEC Plymouth, MN 55441 thompson@pan.ssec.honeywell.com As ever, my opinions do not necessarily agree with Honeywell's or reality's. (Honeywell's do not necessarily agree with mine or reality's, either)
etb@milton.u.washington.edu (Eric Bushnell) (02/01/91)
I've had similar problems. The replies I've seen so far are good ones, but if your nodes are *really* screwed, you may need more drastic measures. As others have explained, the problem is that some or all of the ncs daemons are not running/communicating. The other problem is that you have to be root, or look like root, to fix some of these. (catch-22) Apparently, your node can't find or use the local registry, /sys/registry/rgy_local. If you can, log in as user and use rbak or /install/tools/rbak_sr10 to load in a local registry from another node. (Make sure that one has a root account in it.) Then you can get on as root and clean up any corrupted files that may be causing your grief. You may have to reboot and futz around a few times before you get the right combination of fixes. Here's another possibility: I have, on occasion, had to use /install/tools/rgy_create to build a new registry from scratch, then reboot the node and replace the registry with a backup copy. I hope you don't have to resort to invol-ing the disk and reloading everything. Now lemme tellya 'bout the time the power failed whilst changing the network number, and when they woke up, none of 'em knew where they were... Eric Bushnell UW Civil Engineering etb@milton.u.washingto.edu etb@augustus.ce.washington.edu