[comp.sys.apollo] RGYD at 10.2

root@VLSI-MENTOR.JPL.NASA.GOV (The vlsi-mentor Super User) (07/20/90)

WHY is my rgyd taking up 95% of the cpu time at sr10.2?
Apparently, the rgyd has now become the thing's null job.
HALP!

----
Dave Hayes  dave@vlsi-mentor.jpl.nasa.gov   dave%vlsi-mentor@jpl-mil.jpl.nasa.gov
"The word 'choice' is a fraud when one is taught what to choose."

root@VLSI-MENTOR.JPL.NASA.GOV (The vlsi-mentor Super User) (07/21/90)

>When /etc/rgyd goes berserk, it is frequently caused by a failure in the underlying
>NCS support (/etc/ncs/llbd and/or /etc/ncs/glbd), which in turn relies on the TCP/IP
>services to be up and running. Yes, I know that NCS is *supposed* to be able to use

Yup. I was running a 10.1 glbd and a 10.2 llbd. Silly me! 

Thanks to all who responded (George Zipperlin and David Krowitz) for the help. 
It works now. I still have a question:

>If TCP/IP services are working correctly, then another possible cause is a global
>location broker database problem. If you run more than a single copy of /etc/ncs/glbd
>on your network (and it is *highly* recommended that you do so) and the system clocks
>on the nodes running each of the copies are not within 5 minutes of each other, then
>changes to the glbd database made on one machine may not get propagated to the other
>machines. You can use /etc/ncs/drm_admin to check this and to forcably merge the db
>contents.

Why is it reccommended to run more than one copy of glbd? When is HPOLLO going
to figure out a way to sync their clocks?
----
Dave Hayes  dave@vlsi-mentor.jpl.nasa.gov   dave%vlsi-mentor@jpl-mil.jpl.nasa.gov
"The word 'choice' is a fraud when one is taught what to choose."

krowitz@RICHTER.MIT.EDU (David Krowitz) (07/21/90)

There are a couple of reasons to run more than one copy of /etc/ncs/glbd (this
applies to /etc/rgyd, too). The first is for overall network reliability. If
you only have one copy of the NCS server running and that node goes down for
any reason, then *no* applications which use NCS services can find each other
unless they just happen by accident to be running on the same node. If you have
more than one copy of the global server, then the local broker on your node
can usually find one of the alternate global brokers. The second reason is
for distributing the workload. This is only really needed if you have a lot
of NCS applications running on your net (either a lot of nodes each running
a few applications or a few nodes running a lot of applications). Note that
login/logout (which use rgyd, which in turn uses NCS), printing via either
prf or lpr, and debugging with DDE are all common activities which use NCS
services. So do ftp and telnet (login/logout), any Unix program which reads
/etc/passwd or /etc/group (which are special objects whose type-manager call
rgyd to extract the registry info), etc. 

The big problem is that since NCS services and registry services are now
part of the low-level system services they *MUST* be extremely robust or
the entire system fails ... and they aren't all that reliable. 

As for the clocks ... this is another reason why I hate Unix ... reliable
system operation requires yet another service which is not provided by
the OS. /etc/timed can alledgedly be used to keep the clocks consistant,
but it's just another server which I've got to configure and run on every
single node in the network, and of course it's built on top of TCP services
so that it fails when TCP fails -- and TCP services can be made to fail
in so many ways which are completely unrelated to an actually network 
failure (ie. cabling, networking card, host down, network jammed, etc)
that it's frightening. By contrast, DDS services only fail when the network
fails, and they usually recover then the network recovers. TCP based
services general require that the individual servers be killed and
restarted. 


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)