agq@itd0.dsto.oz (Ashley Quick) (08/04/90)
A week or two ago, I posted an item about problems creating remote (shell) processes on other nodes. The problem: on node A, try to create process on B: crp -on B -me which would fail with an error similar to "cannot create rmeote process - no rights to server mailbox" (I cannot remember the exact message but this is the gist of it) The solution: check in your /dev directory, and see what the crp devices look like. In our case they were all sorts of different sizes. Get your system administrator to re-create the crp devices with: mkdev /dev crp And you will then be able to remote log in. (Interesting note: when created, the crp devices have a type of 'mbx'. After they have been used by a remote login, the type changes to 'spmio'. Curious eh?) The configuration is a DN4500 running SR10.1, though the same problem has been seen on a diskless DN3500, also running SR10.1. This may not be a problem under later SR's.... but what causes the devices to get screwed up in the first place? Ashleigh Quick AGQ@dstos3.dsto.oz.au
bep@quintro.uucp (Bryan Province) (08/07/90)
In article <1147@fang.dsto.oz> agq@dstos3.dsto.oz (Ashleigh Quick) writes: > crp -on B -me > >which would fail with an error similar to > "cannot create rmeote process - no rights to server mailbox" > >.... but what causes the devices to get >screwed up in the first place? > >Ashleigh Quick >AGQ@dstos3.dsto.oz.au The only explanation I have gotten is that it happens when you are logged in as root and do a "cpr -on B -me" to another machine. Take a look at the protections on the /dev/crp* files on the machine. They should be wide open but after root gets a hold of them they get set so that only root can do anything with them. The strange thing is that root can't crp on either after this has happened. Instead of running /etc/mkdev you could also just change the rights of the /dev/crp* files. The way to avoid the problem is to avoid doing a "crp -on B -me" when logged in as root. If you do this check the protections of the crp files of machine B after disconnecting. -- --=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-- Bryan Province Glenayre Corp. quintro!bep@lll-winken.llnl.gov Quincy, IL tiamat!quintro!bep@uunet "Surf Kansas, There's no place like home, Dude."
jonathan@jarthur.Claremont.EDU (Jonathan Ball) (08/08/90)
In article <1147@fang.dsto.oz> agq@dstos3.dsto.oz (Ashleigh Quick) writes: >Get your system administrator to re-create the crp devices with: > mkdev /dev crp > >but what causes the devices to get >screwed up in the first place? One bug with SR10.1 is that if any option other than the -me option is used, such as -login username (or even if -me is accidentally omitted), the crp device will be corrupted. It is known and has, I believe, been corrected in later versions of Domain/OS. -- jonathan@jarthur.claremont.edu (134.173.4.42)
agq@itd0.dsto.oz (Ashley Quick) (08/10/90)
Thank you to all those who posted or E-mailed me about the create process problem. We have re-created the /dev/crp devices, and things have got better. In future we will check the ACLs. However, they are not completely cured. We have a MUCH BIGGER problem, and the CRP problem is but a small symptom of the bigger one!!! See my next posting for details - this includes some private E-Mail, which I am including - the more help the better..... Interesting point about crp devices..... Only /dev/crp00 ever seems to get used.... (tried CRPing onto a node about 20 times from a variety of sources.... lots of crpxx things in /tmp, BUT only /dev/crp00 was locked.... (this by doing llkob | grep '/dev' and llkob | grep '/tmp') Also - there are 16 crp devices, but you can crp onto a node more than 16 times...... Curious - but as it seems to work I wont question why/how. Ashleigh Quick AGQ@dstos3.dsto.oz.au
agq@itd0.dsto.oz (Ashley Quick) (08/10/90)
My previous posting mentioned a big BIG problem. I have exchanged some E-Mail with others about this, and am posting to get a wide coverage. We have a node here running SR10.1. When we run three print servers, strange things begin to happen... like "prf -list_pr" will fail with a message "unable to locate printers for site xxxxx - unable to bind socket". When it is sick the /etc/ncs/lb_admin utility will not talk to the local location broker (llbd), you cannot CRP off or on the node, etc. Killing one of the print servers makes things a little better. Then you may be able to CRP onto the node once of twice... after that CRP will just die (if coming in from elsewhere), and trying to CRP out of the sick node bombs with a similar error to that above. Here is an edited version of what I have sent to Dave Krowitz, which contains an edited version of some of his earlier suggestions.... Msg> Recently you mailed me with some info about our wierd and wonderful Msg> problem of 'no more sockets' (also known as 'cant bind socket'). Msg> Msg> You sent: Msg> Msg> > Msg> >My guess is that the problem is not with your pty's. NCS is a method of Msg> [... more on ptys] Msg> > Msg> >If /etc/ping, ftp, telnet, rlogin, etc. work between the nodes in question, Msg> >then your TCP services are probably OK. /etc/ping will tell you that *some* Msg> [etc] Msg> Msg> TCP services are working OK. We run tcpd and inetd on every node in our Msg> network. One central group do the administration, OS build/install, etc. Msg> They are as fooled by this problem as anybody. Msg> Msg> >If your TCP services seem ok, then start checking your llbd's on the nodes Msg> >in question, and the glbd's on all nodes in your network which run the global Msg> >broker. /etc/ncs/llbd_admin and /etc/ncs/drm_admin are the tools to use for Msg> >this. drm_admin will tell you if the global databases are out of synch and Msg> >if the clocks on the nodes are different. Run it on each node in question Msg> >and see that the list of glbd sites is the same on each node! (some nodes Msg> >may only see a subset of all the glbd's that are supposed to be running). Msg> > Msg> Msg> OK. Msg> On our net we have 3 glbd's running. I have checked them. They all Msg> know about each other, on the right nodes. The clocks are in sync to Msg> within about 30 seconds. [Our sys admin people complained bitterly Msg> about the crummy hardware which lets the clocks slip - when system Msg> software depends on them being accurate.] Msg> Msg> I ran /etc/ncs/lb_admin on each of the nodes, and cleaned up the glb Msg> and llb data bases. (Some of which did contain old/inaccurate Msg> garbage]. Msg> Msg> The problem, after all of this, has not gone away. It only seems to Msg> happen [be most apparent] when I have 3 prsvrs running. Msg> Msg> To recap: A DN4500, running SR10.1. We run three print servers (One is Msg> a LaserJet with my own driver [which is incomplete - but works Msg> enough], another is a line printer [via a National Instruments GPIB Msg> port!!!!], and a HP7550 plotter. This gives service for a number of Msg> applications, including Mentor Graphics, and simulation tools from Msg> Eesof. This node also runs the print manager (for this "site"), as Msg> well as tcpd, inetd, llbd, glbd, spm, etc.... Msg> Msg> The problem is not always apparent. When it is there, I have noticed Msg> the following: Msg> Msg> prf -list_printers Msg> This will list each of the print manager sites in the Msg> network, with a message saying something like 'unable to Msg> locate printers for site xxxxxx - unable to bind socket'. (Or Msg> was it '... - no more free sockets'?) Msg> Msg> /etc/ncs/lb_admin Msg> When the problem is apparent, this will NOT COMMUNICATE at Msg> all with the local location broker. (ie cannot lookup, clean, Msg> etc ). Msg> Msg> crp -on fred -me Msg> Doing this from the problem node fails with the same message Msg> about no more sockets. Msg> Msg> All I do is to kill any one of the three print servers for things to Msg> get better. So far, our sys admin say thats what we should do. (Not a Msg> solution to the problem, though.) Msg> Msg> When the node in question is 'sick', other nodes can use prf -list_pr Msg> and see the printers which the sick nodes print manager is managing. Msg> Msg> SOMETIMES taking the sick node down to the phase II shell and coming Msg> up again will cure it. For a while. Msg> Msg> When the node is not sick, it will eventually become sick. No operator Msg> intervention is required to bring on a bout of sickness!!! Msg> Msg> Msg> Msg> Questions: Msg> Msg> Is there a limit in DOMAIN/OS on the number of print servers that can Msg> be run on a node? (And if so, WHY????) Msg> Msg> Is there a limit on the number of 'sockets' available for NCS type Msg> services? (Again - if so why?) If there is a limit - can it be Msg> configured in any way?????? Msg> Msg> Has anybody else seen this? Sould I report it as an APR or am I doing Msg> something really stupid? Msg> Msg> This seems to indicate a fairly major problem in NCS - as if something Msg> somewhere is using resources (sockets?), and not freeing them Msg> afterwards. (Or maybe the old un-initialised variable trick?!). Msg> Msg> Maybe it gets cured in later releases? (I wait for the day we go up to Msg> SR10.2 - only our Mentor stuff is holding us back). [end of E-mail message] Since sending this off, I have done some more investigating. I started with the sick node, and from another node, tried to CRP onto node 'sick' (not its real name - but I may as well protect the innocent[?!]). I looked at how many remote processes could log in, and found that as I killed processes on node 'sick', I could create more remote processes before things died (ie went sick). As it looks a lot like 'crp' uses NCS services, this seems fair enough. Then, I killed off process 'netman'. (Diskless node boot server I think). Bingo. All came good. But after a re-start [=>phase 2 and back again], things are their normal sick selves. (Netman is still there). It appears to me that there is some kind of limitation brought about by NCS services just running out. Also, dont blame my own home grown servers - I killed them off and things can still get sick! (I also do not believe that server processes which open mailboxes and wait on event counts can really make things misbehave so badly - although it does acquire a device - but that would just be too silly...) So, does anybody have any suggestions / comments? See questions above. Will SR10.2 fix this? Yours in frustration Ashleigh Quick AGQ@dstos3.dsto.oz.au
kerr@tron.UUCP (Dave Kerr) (08/10/90)
In article <1174@fang.dsto.oz> agq@dstos3.dsto.oz (Ashleigh Quick) writes: > > [ description of problems with crp deleted ] >Msg> Questions: >Msg> >Msg> Is there a limit in DOMAIN/OS on the number of print servers that can >Msg> be run on a node? (And if so, WHY????) There may be a limit but it's much greater than 3. In the july 90 patch tape release notes there's a patch (160) that fixes a bug in prflib when you have more than 132 print servers registered with the glb. Three should be no problem. >Msg> Is there a limit on the number of 'sockets' available for NCS type >Msg> services? (Again - if so why?) If there is a limit - can it be >Msg> configured in any way?????? Don't know. >Msg> Has anybody else seen this? Sould I report it as an APR or am I doing >Msg> something really stupid? I think you should contact apollo if you haven't already done so, then if you're not satisfied with their responce submit an APR. >Msg> Maybe it gets cured in later releases? (I wait for the day we go up to >Msg> SR10.2 - only our Mentor stuff is holding us back). You might be interested in something that appeared in a Mentor newsletter I recently received. " In September, in response to strong customer demand, Mentor Graphics will ship a software update to our 7.0 release. The key benefit of this release is that it will be running on Apollo's 10.3 Operating System." >So, does anybody have any suggestions / comments? >See questions above. For what it's worth we have had problems with /bin/ksh and crp. There's a patch on the july tape to fix this for sr10.1 (patch 164). The problem is that the shell will hang when you do a crp to another node. Dave -- -- Dave Kerr (301) 765-4453 (WIN)765-4453 tron::kerr Internal WEC vax mail kerr@tron.bwi.wec.com from an Internet site kerr@tron.UUCP from a smart uucp mailer
system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (08/10/90)
In article <631@tron.UUCP> kerr@tron.bwi.wec.com (Dave Kerr) writes: >>Msg> Is there a limit on the number of 'sockets' available for NCS type >>Msg> services? (Again - if so why?) If there is a limit - can it be >>Msg> configured in any way?????? > >Don't know. The limit on both NCS and TCP/IP sockets was too low in SR10.0/SR10.1, especially on DN10K systems or if you run a lot of UNIX daemons, some of which use both NCS and TCP/IP (e.g. lpd) (I believe it was 23 in each case). This was fixed in SR10.2 and SR10.2.p. -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775