[comp.sys.apollo] create remote process

agq@itd0.dsto.oz (Ashley Quick) (08/04/90)

A week or two ago, I posted an item about problems creating remote
(shell) processes on other nodes.

The problem:

on node A, try to create process on B:
   crp -on B -me

which would fail with an error similar to
  "cannot create rmeote process - no rights to server mailbox"
(I cannot remember the exact message but this is the gist of it)



The solution:

check in your /dev directory, and see what the crp devices look like.
In our case they were all sorts of different sizes.

Get your system administrator to re-create the crp devices with:
  mkdev /dev crp

And you will then be able to remote log in.

(Interesting note: when created, the crp devices have a type of 'mbx'.
After they have been used by a remote login, the type changes to 'spmio'.
Curious eh?)

The configuration is a DN4500 running SR10.1, though the same problem
has been seen on a diskless DN3500, also running SR10.1. This may not
be a problem under later SR's.... but what causes the devices to get
screwed up in the first place?

Ashleigh Quick
AGQ@dstos3.dsto.oz.au

bep@quintro.uucp (Bryan Province) (08/07/90)

In article <1147@fang.dsto.oz> agq@dstos3.dsto.oz (Ashleigh Quick) writes:
>   crp -on B -me
>
>which would fail with an error similar to
>  "cannot create rmeote process - no rights to server mailbox"
>
>.... but what causes the devices to get
>screwed up in the first place?
>
>Ashleigh Quick
>AGQ@dstos3.dsto.oz.au

The only explanation I have gotten is that it happens when you are logged in
as root and do a "cpr -on B -me" to another machine.  Take a look at the
protections on the /dev/crp* files on the machine.  They should be wide open
but after root gets a hold of them they get set so that only root can do
anything with them.  The strange thing is that root can't crp on either after
this has happened.  Instead of running /etc/mkdev you could also just change
the rights of the /dev/crp* files.  The way to avoid the problem is to avoid
doing a "crp -on B -me" when logged in as root.  If you do this check the
protections of the crp files of machine B after disconnecting.
-- 
--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
Bryan Province     Glenayre Corp.           quintro!bep@lll-winken.llnl.gov
                   Quincy,  IL              tiamat!quintro!bep@uunet
           "Surf Kansas, There's no place like home, Dude."

jonathan@jarthur.Claremont.EDU (Jonathan Ball) (08/08/90)

In article <1147@fang.dsto.oz> agq@dstos3.dsto.oz (Ashleigh Quick) writes:

>Get your system administrator to re-create the crp devices with:
>  mkdev /dev crp
>
>but what causes the devices to get
>screwed up in the first place?

One bug with SR10.1 is that if any option other than the -me option is used,
such as -login username (or even if -me is accidentally omitted), the crp
device will be corrupted.  It is known and has, I believe, been corrected in
later versions of Domain/OS.

-- 
jonathan@jarthur.claremont.edu (134.173.4.42)

agq@itd0.dsto.oz (Ashley Quick) (08/10/90)

Thank you to all those who posted or E-mailed me about the create process
problem.

We have re-created the /dev/crp devices, and things have got better.
In future we will check the ACLs.

However, they are not completely cured. We have a MUCH BIGGER problem,
and the CRP problem is but a small symptom of the bigger one!!!

See my next posting for details - this includes some private E-Mail,
which I am including - the more help the better.....





Interesting point about crp devices..... Only /dev/crp00 ever seems to
get used....


(tried CRPing onto a node about 20 times from a variety of sources....
 lots of crpxx things in /tmp, BUT only /dev/crp00 was locked....
 (this by doing llkob | grep '/dev' and llkob | grep '/tmp')
 Also - there are 16 crp devices, but you can crp onto a node more
 than 16 times......
 Curious - but as it seems to work I wont question why/how.



Ashleigh Quick
AGQ@dstos3.dsto.oz.au

agq@itd0.dsto.oz (Ashley Quick) (08/10/90)

My previous posting mentioned a big BIG problem.

I have exchanged some E-Mail with others about this, and am posting to
get a wide coverage.

We have a node here running SR10.1. When we run three print servers,
strange things begin to happen... like "prf -list_pr" will fail with a
message "unable to locate printers for site xxxxx - unable to bind
socket". When it is sick the /etc/ncs/lb_admin utility will not talk
to the local location broker (llbd), you cannot CRP off or on the
node, etc.

Killing one of the print servers makes things a little better. Then
you may be able to CRP onto the node once of twice... after that CRP
will just die (if coming in from elsewhere), and trying to CRP out of
the sick node bombs with a similar error to that above.



Here is an edited version of what I have sent to Dave Krowitz, which
contains an edited version of some of his earlier suggestions....


Msg> Recently you mailed me with some info about our wierd and wonderful
Msg> problem of 'no more sockets' (also known as 'cant bind socket').
Msg>
Msg> You sent:
Msg>
Msg> >
Msg> >My guess is that the problem is not with your pty's. NCS is a method of
Msg>         [... more on ptys]
Msg> >
Msg> >If /etc/ping, ftp, telnet, rlogin, etc. work between the nodes in question,
Msg> >then your TCP services are probably OK. /etc/ping will tell you that *some*
Msg>       [etc]
Msg>
Msg> TCP services are working OK. We run tcpd and inetd on every node in our
Msg> network. One central group do the administration, OS build/install, etc.
Msg> They are as fooled by this problem as anybody.
Msg>
Msg> >If your TCP services seem ok, then start checking your llbd's on the nodes
Msg> >in question, and the glbd's on all nodes in your network which run the global
Msg> >broker. /etc/ncs/llbd_admin and /etc/ncs/drm_admin are the tools to use for
Msg> >this. drm_admin will tell you if the global databases are out of synch and
Msg> >if the clocks on the nodes are different. Run it on each node in question
Msg> >and see that the list of glbd sites is the same on each node! (some nodes
Msg> >may only see a subset of all the glbd's that are supposed to be running).
Msg> >
Msg>
Msg> OK.
Msg> On our net we have 3 glbd's running. I have checked them. They all
Msg> know about each other, on the right nodes. The clocks are in sync to
Msg> within  about 30 seconds. [Our sys admin people complained bitterly
Msg> about the crummy hardware which lets the clocks slip - when system
Msg> software depends on them being accurate.]
Msg>
Msg> I ran /etc/ncs/lb_admin on each of the nodes, and cleaned up the glb
Msg> and llb data bases. (Some of which did contain old/inaccurate
Msg> garbage].
Msg>
Msg> The problem, after all of this, has not gone away. It only seems to
Msg> happen [be most apparent] when I have 3 prsvrs running.
Msg>
Msg> To recap: A DN4500, running SR10.1. We run three print servers (One is
Msg> a LaserJet with my own driver [which is incomplete - but works
Msg> enough], another is a line printer [via a National Instruments GPIB
Msg> port!!!!], and a HP7550 plotter. This gives service for a number of
Msg> applications, including Mentor Graphics, and simulation tools from
Msg> Eesof.   This node also runs the print manager (for this "site"), as
Msg> well as tcpd, inetd, llbd, glbd, spm, etc....
Msg>
Msg> The problem is not always apparent. When it is there, I have noticed
Msg> the following:
Msg>
Msg>     prf -list_printers
Msg>          This will list each of the print manager sites in the
Msg>          network, with a message saying something like 'unable to
Msg>          locate printers for site xxxxxx - unable to bind socket'. (Or
Msg>          was it '... - no more free sockets'?)
Msg>
Msg>     /etc/ncs/lb_admin
Msg>          When the problem is apparent, this will NOT COMMUNICATE at
Msg>          all with the local location broker. (ie cannot lookup, clean,
Msg>          etc ).
Msg>
Msg>     crp -on fred -me
Msg>          Doing this from the problem node fails with the same message
Msg>          about no more sockets.
Msg>
Msg> All I do is to kill any one of the three print servers for things to
Msg> get better. So far, our sys admin say thats what we should do. (Not a
Msg> solution to the problem, though.)
Msg>
Msg> When the node in question is 'sick', other nodes can use prf -list_pr
Msg> and see the printers which the sick nodes print manager is managing.
Msg>
Msg> SOMETIMES taking the sick node down to the phase II shell and coming
Msg> up again will cure it. For a while.
Msg>
Msg> When the node is not sick, it will eventually become sick. No operator
Msg> intervention is required to bring on a bout of sickness!!!
Msg>
Msg>
Msg>
Msg> Questions:
Msg>
Msg> Is there a limit in DOMAIN/OS on the number of print servers that can
Msg> be run on a node? (And if so, WHY????)
Msg>
Msg> Is there a limit on the number of 'sockets' available for NCS type
Msg> services? (Again - if so why?) If there is a limit - can it be
Msg> configured in any way??????
Msg>
Msg> Has anybody else seen this? Sould I report it as an APR or am I doing
Msg> something really stupid?
Msg>
Msg> This seems to indicate a fairly major problem in NCS - as if something
Msg> somewhere is using resources (sockets?), and not freeing them
Msg> afterwards. (Or maybe the old un-initialised variable trick?!).
Msg>
Msg> Maybe it gets cured in later releases? (I wait for the day we go up to
Msg> SR10.2 - only our Mentor stuff is holding us back).

    [end of E-mail message]

Since sending this off, I have done some more investigating. I started
with the sick node, and from another node, tried to CRP onto node
'sick' (not its real name - but I may as well protect the
innocent[?!]). I looked at how many remote processes could log in,
and found that as I killed processes on node 'sick', I could create
more remote processes before things died (ie went sick). As it looks a
lot like 'crp' uses NCS services, this seems fair enough.

Then, I killed off process 'netman'. (Diskless node boot server I
think). Bingo. All came good. But after a re-start [=>phase 2 and
back again], things are their normal sick selves. (Netman is still
there).

It appears to me that there is some kind of limitation brought about
by NCS services just running out. Also, dont blame my own home grown
servers - I killed them off and things can still get sick! (I also do
not believe that server processes which open mailboxes and wait on
event counts can really make things misbehave so badly - although it
does acquire a device - but that would just be too silly...)

So, does anybody have any suggestions / comments?
See questions above.
Will SR10.2 fix this?


Yours in frustration

Ashleigh Quick
AGQ@dstos3.dsto.oz.au

kerr@tron.UUCP (Dave Kerr) (08/10/90)

In article <1174@fang.dsto.oz> agq@dstos3.dsto.oz (Ashleigh Quick) writes:
>
>
[ description of problems with crp deleted ]

>Msg> Questions:
>Msg>
>Msg> Is there a limit in DOMAIN/OS on the number of print servers that can
>Msg> be run on a node? (And if so, WHY????)

There may be a limit but it's much greater than 3. In the
july 90 patch  tape release notes there's a patch (160) that
fixes a bug in prflib when you have more than 132 print
servers registered with the glb. Three should be no problem.

>Msg> Is there a limit on the number of 'sockets' available for NCS type
>Msg> services? (Again - if so why?) If there is a limit - can it be
>Msg> configured in any way??????

Don't know.

>Msg> Has anybody else seen this? Sould I report it as an APR or am I doing
>Msg> something really stupid?

I think you should contact apollo if you haven't already
done so, then if you're not satisfied with their responce
submit an APR.

>Msg> Maybe it gets cured in later releases? (I wait for the day we go up to
>Msg> SR10.2 - only our Mentor stuff is holding us back).

You might be interested in something that appeared in a
Mentor newsletter I recently received.

" In September, in response to strong customer demand,
Mentor Graphics will ship a software update to our 7.0
release. The key benefit of this release is that it will be
running on Apollo's 10.3 Operating System."

>So, does anybody have any suggestions / comments?
>See questions above.

For what it's worth we have had problems with /bin/ksh and
crp. There's a patch on the july tape to fix this for sr10.1
(patch 164). The problem is that the shell will hang when
you do a crp to another node.

Dave
--
-- 
Dave Kerr (301) 765-4453 (WIN)765-4453
tron::kerr                 Internal WEC vax mail
kerr@tron.bwi.wec.com      from an Internet site
kerr@tron.UUCP             from a smart uucp mailer

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (08/10/90)

In article <631@tron.UUCP> kerr@tron.bwi.wec.com (Dave Kerr) writes:
>>Msg> Is there a limit on the number of 'sockets' available for NCS type
>>Msg> services? (Again - if so why?) If there is a limit - can it be
>>Msg> configured in any way??????
>
>Don't know.

The limit on both NCS and TCP/IP sockets was too low in SR10.0/SR10.1,
especially on DN10K systems or if you run a lot of UNIX daemons,
some of which use both NCS and TCP/IP (e.g. lpd)
(I believe it was 23 in each case). This was fixed in SR10.2 and SR10.2.p.
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775