[comp.sys.apollo] YARR

holtz@zonker.cascade.carleton.ca (Neal Holtz) (10/09/90)

I am sure this question has been asked (and answered) before, but I can't find
it in my archives...

Configuration:
	1 - DN4500, disked
	3 - DN3500s, disked
	4 - DN 2500s, diskless - one booted off each disked node
	SR10.2 Aegis & BSD
	1 Master registry, on DN4500, no replications
	DN4500 runs glbd, llbd, rgyd
	each DN3500 runs llbd

Problem:
	With time, the registry server becomes unavailable on the DN3500's
	and users are logged in using the local registries.  This seems
	to take the form of a gradual rot, with more of the DN3500s becoming
	'serverless'.                                                          

	Also, of course, '/etc/passwd' is unreadable (doesn't exist).

	However, the registries are still available to the diskless
	DN2500s booted off the DN3500s, and /etc/passwd is OK.

	And the DN3500 has no trouble seeing the files on and otherwise
	communicating with the rgy server node.
                     

Attempted fixes:

	We have rebooted everything, we have manually restarted various
	servers, and looked at a few log files in in /usr/adm.
	Nothing worked, and no clues, either.  Debugging via Apollos hot line
 	seems to take a long time, as well. 

	The clocks are set to within a few seconds.

Details:

Processes on the Master Registry node (DN4500):

    1 > ps -ax
      PID TTY     STAT  TIME COMMAND
        1 ?       S <   0:23 /etc/init
        2 ?       R   1461:10 null
        3 ?       S     0:51 purifier
        4 ?       S     0:29 purifier
        5 ?       S     0:50 unwired_dxm
        6 ?       S     0:00 pinger
        7 ?       S     0:04 netreceive
        8 ?       S     0:39 netpaging
        9 ?       S     0:18 wired_dxm
       10 ?       S     0:38 netrequest
       91 ?       S     5:10 /etc/tcpd
       96 ?       S     1:25 /etc/routed -f -q
       99 ?       S     0:00 /etc/inetd
      102 ?       S     0:00 /etc/ncs/llbd
      104 ?       S     1:49 /etc/ncs/glbd
      107 ?       S <   2:02 /etc/rgyd
      112 ?       S     0:04 /sys/spm/spm
      115 ?       S     0:03 /sys/net/netman
      117 ?       S     0:03 /sys/ns/ns_helper
      120 ?       S     0:08 /sys/alarm/alarm_server -disk 98 -msg -w 0 0 550 100 -
      122 ?       S     0:01 /sys/mbx/mbx_helper
      125 ?       S <   0:08 /etc/Xapollo -K /usr/X11/lib/keyboard/keyboard.config 
      127 ?       S <   8:29 dm
	

processes on the serverless DN3500:

    Connected to node 19A9D   "//thorin"
    login: 
    Password: 
    Using local registry. Can't use network registry:  - Registry server unavailable (from RGYC / Server)
    1 > ps -ax
      PID TTY     STAT  TIME COMMAND
        1 ?       S <   0:28 /etc/init
        2 ?       R   158:32 null
        3 ?       S     0:05 purifier
        4 ?       S     0:00 purifier
        5 ?       S     0:09 unwired_dxm
        6 ?       S     0:00 pinger
        7 ?       S     0:00 netreceive
        8 ?       S     0:23 netpaging
        9 ?       S     0:02 wired_dxm
       10 ?       S     0:13 netrequest
       92 ?       S     0:00 /etc/ncs/llbd
       97 ?       S     0:01 /sys/spm/spm
       99 ?       S     0:03 /sys/net/netman
      101 ?       S     0:02 /sys/alarm/alarm_server -disk 98 -msg -w 0 0 550 100 -v 20 20
      105 ?       S     0:00 /sys/mbx/mbx_helper
      107 ?       S <   0:05 /etc/Xapollo -K /usr/X11/lib/keyboard/keyboard.config -D1 s+r-
      109 ?       S <   4:09 dm


Perhaps I'll dig out my SR8 floppies and re-install :-(
--
Prof. Neal Holtz,  Dept. of Civil Eng.,  Carleton University,  Ottawa, Canada
Internet: holtz@civeng.carleton.ca   Tel: (613)788-5797    Fax: (613)788-3951

goldfish@CONCOUR.CS.CONCORDIA.CA (-- Paul Goldsmith) (10/10/90)

Your problem is probably related to interaction between the registry
daemons and "tcpd".

See your "ncs" manual (a thin thing you porobably won't even remember
seeing) and lookup the part on selecting TCP versus the internal
Apollo carrier protocol.  If "tcpd" is up when the glbd & llbd start,
they use tcp as a carrier.  THIS DOESN'T WORK VERY WELL.  

The text below should offer some suggestions.  Run the same "grep" on
your system and compare the output.  I have several extra lines.
Check the manual, Complain like hell on the hotline, and make sure you
understand the changes before going at it.  

concour,goldfish 215 grep lbd /etc/rc*
/etc/rc:# llbd must fork and the parent exit before other NCS servers may
/etc/rc:# be run.  Do not use "/etc/ncs/llbd &"
/etc/rc:# that the llbd will listen to.  e.g.,
/etc/rc:#   /etc/ncs/llbd -li dds
/etc/rc:# will limit the llbd to listen on the dds protocol family - forcing it to
/etc/rc:if [ -f /etc/ncs/llbd -a -f /etc/daemons/llbd ]; then
/etc/rc:	(echo " llbd\c" >/dev/console)
/etc/rc:#	/etc/ncs/llbd 
/etc/rc:	/etc/ncs/llbd -li dds
/etc/rc:# /etc/ncs/llbd is needed to run the glbd.
/etc/rc:# that the glbd will listen to.  e.g.,
/etc/rc:#   /etc/ncs/glbd -li ip
/etc/rc:# will limit the llbd to listen on the ip protocol family - forcing it to
/etc/rc:if [ -f /etc/ncs/glbd -a -f /etc/daemons/glbd -a $LLBD_ENABLED = true ]; then
/etc/rc:	(echo " glbd\c" >/dev/console)
/etc/rc:#	/etc/ncs/glbd &
/etc/rc:	/etc/ncs/glbd -li dds &
/etc/rc:# determines which nodes these are).  rgyd requires that /etc/ncs/llbd 
/etc/rc:# /etc/ncs/llbd is needed to run lpd.
concour,goldfish 216 


-- Paul Goldsmith
    (goldfish)        (514) 848-3031         <goldfish@concour.cs.concordia.ca>
         (Shirley Maclaine told me there would be LIFETIMES like this)