root@VLSI-MENTOR.JPL.NASA.GOV (The vlsi-mentor Super User) (07/20/90)
WHY is my rgyd taking up 95% of the cpu time at sr10.2? Apparently, the rgyd has now become the thing's null job. HALP! ---- Dave Hayes dave@vlsi-mentor.jpl.nasa.gov dave%vlsi-mentor@jpl-mil.jpl.nasa.gov "The word 'choice' is a fraud when one is taught what to choose."
root@VLSI-MENTOR.JPL.NASA.GOV (The vlsi-mentor Super User) (07/21/90)
>When /etc/rgyd goes berserk, it is frequently caused by a failure in the underlying >NCS support (/etc/ncs/llbd and/or /etc/ncs/glbd), which in turn relies on the TCP/IP >services to be up and running. Yes, I know that NCS is *supposed* to be able to use Yup. I was running a 10.1 glbd and a 10.2 llbd. Silly me! Thanks to all who responded (George Zipperlin and David Krowitz) for the help. It works now. I still have a question: >If TCP/IP services are working correctly, then another possible cause is a global >location broker database problem. If you run more than a single copy of /etc/ncs/glbd >on your network (and it is *highly* recommended that you do so) and the system clocks >on the nodes running each of the copies are not within 5 minutes of each other, then >changes to the glbd database made on one machine may not get propagated to the other >machines. You can use /etc/ncs/drm_admin to check this and to forcably merge the db >contents. Why is it reccommended to run more than one copy of glbd? When is HPOLLO going to figure out a way to sync their clocks? ---- Dave Hayes dave@vlsi-mentor.jpl.nasa.gov dave%vlsi-mentor@jpl-mil.jpl.nasa.gov "The word 'choice' is a fraud when one is taught what to choose."
krowitz@RICHTER.MIT.EDU (David Krowitz) (07/21/90)
There are a couple of reasons to run more than one copy of /etc/ncs/glbd (this applies to /etc/rgyd, too). The first is for overall network reliability. If you only have one copy of the NCS server running and that node goes down for any reason, then *no* applications which use NCS services can find each other unless they just happen by accident to be running on the same node. If you have more than one copy of the global server, then the local broker on your node can usually find one of the alternate global brokers. The second reason is for distributing the workload. This is only really needed if you have a lot of NCS applications running on your net (either a lot of nodes each running a few applications or a few nodes running a lot of applications). Note that login/logout (which use rgyd, which in turn uses NCS), printing via either prf or lpr, and debugging with DDE are all common activities which use NCS services. So do ftp and telnet (login/logout), any Unix program which reads /etc/passwd or /etc/group (which are special objects whose type-manager call rgyd to extract the registry info), etc. The big problem is that since NCS services and registry services are now part of the low-level system services they *MUST* be extremely robust or the entire system fails ... and they aren't all that reliable. As for the clocks ... this is another reason why I hate Unix ... reliable system operation requires yet another service which is not provided by the OS. /etc/timed can alledgedly be used to keep the clocks consistant, but it's just another server which I've got to configure and run on every single node in the network, and of course it's built on top of TCP services so that it fails when TCP fails -- and TCP services can be made to fail in so many ways which are completely unrelated to an actually network failure (ie. cabling, networking card, host down, network jammed, etc) that it's frightening. By contrast, DDS services only fail when the network fails, and they usually recover then the network recovers. TCP based services general require that the individual servers be killed and restarted. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)