[comp.sys.apollo] problems with root access

mikulska@odin.ucsd.edu (Margaret Mikulska) (01/31/91)

Somebody with no net access asked me for help with apollos gone berserk
after a power failure. I'm not familiar with their setup, but if
anything here rings a bell, I'd appreciate a hint.

Basically, they have two servers and some diskless nodes. After the power
failure, the sys admin can't log in as root or do 'su' (getting 'login 
incorrect' or 'sorry'). However, everybody can log in as an ordinary user.
Also, the ownership of files owned by root is shown by root's ID, i.e., 0,
and not by its name 'root'. Some files 'can not be found'.

I know this is a preposterously inadequate description, but that's what
I have. The sys admin is lost and desperate. I myself know more about
other UNIXes than about apollos. Perhaps some apollo guru among you
recognizes what's going on even from this description.

It seems rather obvious to me that some corruption of the file system
has occurred - perhaps of the rgy files ? One can always go back
to the distribution tapes and backups, and restore everything from scratch,
but perhaps this is an overkill. 

Somebody suggested to me that the system might have reset the root passwd
to the default password. I had no chance to check it this helps, but would
this account for file ownership showing as belonging to "0" instead of
"root" ?

Any help/hint/advice appreciated. Thanks in advance.

Margaret Mikulska
UC San Diego

mmikulska@ucsd.edu

krowitz@RICHTER.MIT.EDU (David Krowitz) (01/31/91)

Sounds like perhaps the registry daemons are either dead or
non-communicative due to NCS problems. I believe that the
machines all keep caches of recently used user-ID's in case
they get cut off from the rgy, which may explain why some users
can get logged in, but root (who presumably isn't logged in
very often) can't. Check the rgyd replicas, the glbd and llbd
replicas, and the time-of-day clocks (they need to be within
a few minutes of each other on the nodes running glbd's). Try
using "cat" on /etc/passwd -- if you can't list the file, then
the registries are dead, since this file is really a type-manager
file which invokes the rgyd to create a listing on the fly.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (02/01/91)

In article <16199@sdcc6.ucsd.edu> mikulska@odin.ucsd.edu (Margaret Mikulska) writes:
>Basically, they have two servers and some diskless nodes. After the power
>failure, the sys admin can't log in as root or do 'su' (getting 'login 
>incorrect' or 'sorry'). However, everybody can log in as an ordinary user.
>Also, the ownership of files owned by root is shown by root's ID, i.e., 0,
>and not by its name 'root'. Some files 'can not be found'.
>
>Somebody suggested to me that the system might have reset the root passwd
>to the default password. I had no chance to check it this helps, but would
>this account for file ownership showing as belonging to "0" instead of
>"root" ?

Having no password would not cause this, but having no root account in
the registry would -- are you running a registry on this node, or at
least somewhere in the network? If not, you should be; if so, your node
is not making a connection to it. You should be able to 'cat' the
/etc/passwd file and see all your accounts (root will be first).

I am also assuming that a 'salvol' was done at reboot, either
automatically if rebooted in "Normal" mode (which is set by a switch
somewhere on the node), or manually if rebooted in "Service" mode.
If the node is crashing a lot, you can add

#
# Start the disk updating daemon.
#
(echo Starting /etc/update >/dev/console)
if [ -f /etc/update ]; then
	/etc/update
fi

to your /etc/rc file (just after "find orphans"), which has saved us a lot
of lost/damaged files. If the lost files are critical, and it sounds
like maybe some of them are, you may have to reload the node.
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775

thompson@PAN.SSEC.HONEYWELL.COM (John Thompson) (02/01/91)

(sorry for the bandwidth -- ucsd.edu didn't like me)

Margaret --

> Basically, they have two servers and some diskless nodes. After the power
> failure, the sys admin can't log in as root or do 'su' (getting 'login 
> incorrect' or 'sorry'). However, everybody can log in as an ordinary user.
> Also, the ownership of files owned by root is shown by root's ID, i.e., 0,
> and not by its name 'root'. Some files 'can not be found'.
Sounds to me like the registry daemon isn't being started / starting up.
If, when 'normal users' log on, the DM (display mangler) output window says
something about "Unable to access Network Registry, using local registry"
then that's the case (or else networking is really confused).  You will
sometimes get this "Unable to find..." message for the first bit, as the
rgyd warms up, and as the local node starts looking on the network.

> ...
> It seems rather obvious to me that some corruption of the file system
> has occurred - perhaps of the rgy files ? One can always go back
> to the distribution tapes and backups, and restore everything from scratch,
> but perhaps this is an overkill. 
Unlike standard Unix, Apollos use a registry daemon to service registry 
requests.  The 'files' /etc/passwd, /etc/group, /etc/org, etc are really
special objects that access the network registry daemons to get the information.
That way, you can have one central location for rgy info, so it's always 
up-to-date.  (In practice, you would normally set up a few 'slave' rgyd
processes, so that you have redundancy in case of crashes.  The rgyd processes
exchange info with each other, and keep things up-to-date internally.
In addition to the rgyd process(es), you must have a couple other processes
hanging around to get/keep things working.  You must have a local-location
broker (llbd) on any node that offers NCS services (the rgyd uses NCS, and 
therefore offers services).  You must also have a global-location broker (glbd)
that maintains info from all the llbd processes.  A circular loop is caused,
since the glbd processes offer NCS, so they need to have an llbd running too.

Here's the MINIMUM that you need to have running:
    On a master node (one of the 2 servers), run llbd, glbd, and rgyd.
Here's a BETTER setup (IMHO):
    On each server node, run llbd, glbd, and rgyd.

To start things up:
    llbd      started in /etc/rc -- command is '/etc/ncs/llbd'
    glbd      "" "" "" "" "" "" "" "" "" "" "" '/etc/ncs/glbd'
              to make the first one, '/etc/ncs/glbd -create -first')
              to make others, '/etc/ncs/glbd -create -from //node_with_glbd'
    rgyd      started in /etc/rc -- command is '/etc/rgyd'
              to make a replica(slave), '/etc/rgyd -create'
              to restart a broken one,  '/etc/rgyd -recreate'

If the rgyd isn't starting up successfully, there may be problems, because
you need to become root in order to fix the thing that lets you become
root.  He/she might want to try the default passwords, on the hope that
its hiding in the local-registry (a simple file lookup) w/ an old password.
If a glbd or an llbd isn't running, that fix is a lot easier.  Just get 
on as someone who can add objects to the /etc/daemons stub-file directory,
and add entries.  (If a glbd has never been running, you're in the same 
boat as w/ the rgyd -- you need to be root to create a glbd).  They might
be able to put commands into the /etc/rc file to create things, reboot,
and then un-edit the file back to its original state.

> Somebody suggested to me that the system might have reset the root passwd
> to the default password. I had no chance to check it this helps, but would
> this account for file ownership showing as belonging to "0" instead of
> "root" ?
Probably not the case.  It certainly wouldn't account for the '0' instead of
'root'.

> Any help/hint/advice appreciated. Thanks in advance.
Hope I helped, rather than babbled.  You're welcome.


-- jt --
John Thompson
Honeywell, SSEC
Plymouth, MN  55441
thompson@pan.ssec.honeywell.com

As ever, my opinions do not necessarily agree with Honeywell's or reality's.
(Honeywell's do not necessarily agree with mine or reality's, either)

etb@milton.u.washington.edu (Eric Bushnell) (02/01/91)

I've had similar problems. The replies I've seen
so far are good ones, but if your nodes are
*really* screwed, you may need more drastic measures.

As others have explained, the problem is that some or
all of the ncs daemons are not running/communicating.
The other problem is that you have to be root, or look
like root, to fix some of these. (catch-22)

Apparently, your node can't find or use the local registry,
/sys/registry/rgy_local. If you can, log in as user and
use rbak or /install/tools/rbak_sr10 to load in a local
registry from another node. (Make sure that one has a 
root account in it.)  Then you can get on as root and
clean up any corrupted files that may be causing your grief.
You may have to reboot and futz around a few times before
you get the right combination of fixes.

Here's another possibility: 
I have, on occasion, had to use /install/tools/rgy_create
to build a new registry from scratch, then reboot the node
and replace the registry with a backup copy.

I hope you don't have to resort to invol-ing the disk and 
reloading everything.

Now lemme tellya 'bout the time the power failed whilst changing
the network number, and when they woke up, none of 'em knew
where they were...

Eric Bushnell
UW Civil Engineering
etb@milton.u.washingto.edu
etb@augustus.ce.washington.edu