[comp.sys.next] NetInfo goes on vacation

stefan@lbl-csam.arpa (Stefan gottschalk) (10/28/89)

After rebooting the NeXT, the system forget about all the accounts, hosts,
file systems, and other whatnot contained in the NetInfo database.

Fortunately, the system resorted to the flat ascii files so I was able to
rig myself up an account and begin looking for the cause of the problem.
However, after a couple of days of reading about NetInfo, the booting
procedures, and the various daemons and services which must be in place, I
still haven't any idea as to what went wrong.

Here are some clues:

At the end of the execution of rc.boot, three error messages appear on the
screen in rapid succession.  An instant later the screen blacks out and the
login window appears, but I believe they are as follows:

npd[108]: NetInfo Problem:  Communication failure
npd[108]: attempting to auto register printer
npd[108]: NetInfo Problem:  Communication failure

These messages also appear with various prefixes in the file
private/adm/messages (or /usr/adm/messages, take your pick).  

There is a 'core' file in the /etc/netinfo directory.  I'm told that dbx
can look at coredumps intelligently (if the failing program was compiled
with the '-g' option), but I'm not familiar with dbx or debugging with core
dumps, so I haven't tried that yet.

'ps -x' shows nibindd, the guy who creates the netinfod's, but the netinfo
daemons themselves aren't there. 

'niutil -list . /' produces the message "niutil: can't open .:/"

The man page for nibindd says that it automatically brings up a netinfo
daemon for every *.nidb directory in /etc/netinfo.  So I killed the old
nibindd (with 'kill -QUIT' which seemed to work), and started up a new one.
Sure enough, 'ps -x' showed a 'usr/etc/netinfod local' whose parent
process was nibindd, which was a child process of the actual nibindd that
I started.  But niutil still came back with the "can't open" message.

I figured maybe the local.nidb directory was messed up, so I tried copying
the local.nidb directory from the 'Software Release 1.0' optical over the
the working version on the troubled NeXT.  That didn't help.  No change in
behavior.  That really worries me.  I figure that if it's not the database,
then it is the environment causing trouble, and if that's the case then I
imagine the problem could be anywhere!

Appendix A of the System Administration Guide, "System Initialization,"
p. A-18 states that NetInfo depends upon the portmap daemon.  Good ol'
'ps -x' shows that portmap is indeed present.

I checked the permissions on the '.nidb' files to make sure they were identical
to those on the optical.  They were.  Furthermore, everything I checked _is_
in the correct place in the file system.  The contents of '/etc/hostconfig',
which the rc and rc.local scripts source, are correct, compared to the 
version on the optical.  Note: I am certain that the optical disk has not been
altered.  The first thing we did on receipt of 1.0 was rebuild the SCSI from
the optical, which we then put into storage.

So there it is.  I may wind up simply rebuilding the disk after saving the
important files, but I'd like to know what happened so I can avoid making the
same mistake again.  Probably I zapped some file somewhere while romping around
as root, but I'll be damned if I know which one!

Please note that this is NOT a complaint (it's a plea for help).  The NeXT is
a little frustrating, at times, because of its differences from BSD 4.3. 
But it is these differences that make the NeXT such an interesting and
exciting machine!

Well, if anyone has any suggestions (any at all!) then please don't hesitate
to send me a note.  I'd appreciate it.

Thanks in advance,

		       Stefan Gottschalk (stefan@csam.lbl.gov)

phd_ivo@gsbacd.uchicago.edu (10/28/89)

In article <4038@helios.ee.lbl.gov>, stefan@lbl-csam.arpa (Stefan gottschalk) writes...
 
>  <variety of netinfo problems, basically dead netinfo>

I had the same problems once after fooling around with both netinfo and flat
files alternately. I ultimately resorted to rebuilding my hard-disk. At the
time, I posted a note to comp.sys.next which said that netinfo should have a
way of just saving everything, so that an administrator can play with netinfo
for a while, and if he/she screws up, simply restore the old settings and
reboot. I am now very careful about NetInfo, as I fear the symptoms could
return. Thus, if anyone knew the answer to how to fix this problem, I would
also appreciate the answer.

/ivo