[comp.sys.apollo] Apollo unkillable processes

anderson@atc.sps.mot.com (howard anderson) (09/19/90)

I really need help.  (Apollo Release 10.2 with patch m0118.)
I am having difficulty killing certain processes.  Sometimes the processes
can be killed easily.  Sometimes they can't.  Some random factor is
at work.  When whatever it is does go wrong the process cannot be killed.
Sigp -q doesn't work.  Sigp -s doesn't work.  Sigp -b works but sometimes
appears to destroy important parts of the operating system so that the node has
to be shut down.  Here is a traceback of one of these unkillable processes:

   $ tb -t b.report
   --------------------
   Task 514 "RPC reply acknowledger" (active) 
   In routine  system service "ec2_$wait"
   Called from "task_$sched" line 2702
   Called from "task_$ec2_wait" line 1411
   Called from "ec2_$wait_svc" line 164
   Called from "sleep" line 53
   Called from "periodic_task" line 1494
   Called from "task_$base_proc" line 874
   --------------------
   Task 0 "distinguished_task" (distinguished task) (ready) 
   In routine  "<UID 00000407.004A006A>" offset 7FFFF55C
   Called from "task_$ec2_wait" line 1411
   Called from "ec2_$wait_svc" line 164
   Called from "ec2_$wait" line 203
   Called from "task_$handle_mark_release" line 1942
   Called from "pm_$proc_release" line 2151
   Called from "pgm_$invoke_uid_pn" line 1160
   Called from "pm_$init" line 834
   --------------------
   Task 257 "reaper_task" (waiting) 
   In routine  "<UID 00000407.007600B2>" offset 7FFFEE48
   Called from "task_$ec2_wait" line 1411
   Called from "ec2_$wait_svc" line 164
   Called from "ec2_$wait" line 203
   Called from "task_$reaper_task" line 2248
   Called from "task_$base_proc" line 874
   

Now it looks to me like these are all Apollo routines and that the
user tasks have all been eliminated.  Apollo response center people
agreed that this was the case.  They said that their system routines
may be waiting for some resource that a third-party vendor didn't release.
Since all user code AND the third party vendor code has been sucessfully 
blown away at this point it looks like we will be waiting here a long time. 
(The Apollo response center is closing my call.  They told me to contact
the third-party vendor because it is obviously a problem in the third-party
vendor code.) 

Questions I have for you are these:

  1.  This situation runs counter to my philosophy regarding the
role of an operating system.  The user task has been eliminated
by the operating system.  So now we wait forever for an event 
that cannot happen?  I would not have expected the operating 
system to lose control in a case such as this.  Are my expectations
too high??

  2.  Has anyone else seen processes that cannot be killed with
a sigp -s??  Perhaps I am the only Apollo user with this problem.

  3.  Does anyone know a way to fake out the ec2_wait tasks and make
them think they got what they are looking for?  How much damage do you
think would result if one could do this?

  4.  Does anyone know what blasting processes such as these actually
does to the operating system??  The server_process_manager sometimes
exits.  Is this a possible effect of blasting a process such as the
above??

  5.  When processes such as the above that are using sio lines are blasted,
the sio lines are left "locked".  They cannot be unlocked since they are
not really "locked objects".  I found that copying /dev/sio1 to /dev/siox
then deleting /dev/sio1 then changing the name of /dev/siox to /dev/sio1
will restore /dev/sio1 to service.

  6.  If you are using DANFORD serial lines as well, they become "locked"
in a similar manner.  Copying them and changing their names does not work.
The ssiomonitor must be killed and restarted.  This means that all consoles
served by the ssiomonitor must be shutdown and restarted in order to restart
one line!    

  7.  The group id of a forked child process is sometimes set to zero.
This seems to occur randomly about 20 percent of the time.
When the parent is killed, the child is not killed.  Has anyone
else seen this problem?  (This seems to be unrelated to the unkillable
process problem but perhaps in some way it IS related?)  

PLEASE HELP

zeleznik@cs.utah.edu (Mike Zeleznik) (09/19/90)

In article <2414@dover.sps.mot.com> anderson@atc.sps.mot.com (howard anderson) writes:
>
>I really need help.  (Apollo Release 10.2 with patch m0118.)
>I am having difficulty killing certain processes. ...
>Sigp -q doesn't work.  Sigp -s doesn't work.  Sigp -b works but ...

>  2.  Has anyone else seen processes that cannot be killed with
>a sigp -s??  Perhaps I am the only Apollo user with this problem.

I have been using Apollos for over 7 years now, and processes that have
to be blasted (sigp -b) have come to be (from our experience) a fact of
life.  Though it appears to occur less often these days.

>  4.  Does anyone know what blasting processes such as these actually
>does to the operating system?? 

I don't know the specifics, but Apollo used to clearly recommend that you
reboot the node as soon as possible after blasting a process, since it
left inconsistant state in there, which would cause problems down the
line.  I don't know if they still say this.  It used to be that after
enough blasts, the OS would start to do strange things (as you have seen),
and a reboot usually cleared it all up.

There was a point a few years ago when I used to simply reboot my
workstation each morning on general principle, just to avoid having to find
the problems the hard way.  Things seem to be much more robust these days
with 10.2.

Mike

  Michael Zeleznik              Computer Science Dept.
                                University of Utah
  zeleznik@cs.utah.edu          Salt Lake City, UT  84112
                                (801) 581-5617

roger@GW1.AGS.BNL.GOV (Roger A. Katz) (09/19/90)

In article <2414@dover.sps.mot.com> anderson@atc.sps.mot.com (howard anderson) writes:
>
>I really need help.  (Apollo Release 10.2 with patch m0118.)
>I am having difficulty killing certain processes. ...
>Sigp -q doesn't work.  Sigp -s doesn't work.  Sigp -b works but ...

question: who is the 'owner' of these processes?
try using the command
/etc/server /com/sigp xxxxx -s
if the 'owner' is user.server
also i found that logged in as root, I can sigp anything.

Email: roger@gw1.ags.bnl.gov                Roger A. Katz
                                            AGS Software Controls Group
                                            Brookhaven National Laboratory
                                            Upton, N.Y. 11973-5000
                                            (516) 282-2732
I'm sure, I may be wrong, but I'm sure.

krowitz@RICHTER.MIT.EDU (David Krowitz) (09/19/90)

If you can not kill a process because of the ownership of the process,
you will get a message at SR10 of something like "not owner". We too,
have seen processes hang forever with a "sigp -s" failing to kill the
process. It frequently comes from a cleanup handler waiting for an
event count (file I/O completed, etc.) to be advanced. "sigp -b" is
almost always guaranteed to require a reboot sooner or later.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter.mit.edu@eddie.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

chen@digital.sps.mot.com (Jinfu Chen) (09/20/90)

>question: who is the 'owner' of these processes?
>try using the command
>/etc/server /com/sigp xxxxx -s
>if the 'owner' is user.server
>also i found that logged in as root, I can sigp anything.

Before SR10.2, anyone can sigp any process (except display_manager?) but kill
does check ownership. Sigp behaves better (in terms of ownership) under
SR10.2.

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (09/22/90)

This is a fairly common problem on our DN10020, and has been ever since
we got it in Jan/89. We have also seen this on M68K nodes, but not very
often since SR10.2 came out.

Our DN10000 is running SR10.2.0.5, and this problems seems to come and
go in severity with different OS patches, but never disappears.

Neither 'kill', 'kill -9' or 'sigp -stop' has any effect on the
process, but 'sigp -blast' always works as of SR10.2.p (it didn't
before!). Of course, you better shut down and reboot while you still can
once you blast anything.

The current rumor is that this is a problem on dual cpu 10K's - is
anyone still seeing this on any other configuration? If you call this
problem into the HOTLINE, try mentioning call # AT002411 or A2037038,
which have been open for several months already, so the lucky person
at the Response Center can get up to speed right away :-).
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775

agq@itd1.dsto.oz (Ashley Quick) (09/25/90)

anderson@atc.sps.mot.com (howard anderson) writes:

>I really need help.  (Apollo Release 10.2 with patch m0118.)
>I am having difficulty killing certain processes.  Sometimes the processes
>can be killed easily.  Sometimes they can't.  Some random factor is
    [... things deleted]

>Now it looks to me like these are all Apollo routines and that the
>user tasks have all been eliminated.  Apollo response center people
>agreed that this was the case.  They said that their system routines

You have not said what your application program is. It looks at lot like
a print server to me. Knowing the application would help a lot.

>may be waiting for some resource that a third-party vendor didn't release.
>Since all user code AND the third party vendor code has been sucessfully
>blown away at this point it looks like we will be waiting here a long time.
>(The Apollo response center is closing my call.  They told me to contact
>the third-party vendor because it is obviously a problem in the third-party
>vendor code.)

Is the trace back done AFTER you have sigp'd the process? What flags
were used to sigp it?

>Questions I have for you are these:

>  1.  This situation runs counter to my philosophy regarding the
>role of an operating system.  The user task has been eliminated
>by the operating system.  So now we wait forever for an event
>that cannot happen?  I would not have expected the operating
>system to lose control in a case such as this.  Are my expectations
>too high??

DOMAIN/OS supports multiple threads per task. The manuals make it very
clear that when this is used, special consideration needs to be given
to cleaning up the mess IF THE PROCESS IS aborted for any reason. (This
includes signalling it).

It would appear that the application developer has not followed the
guidelines.

>  2.  Has anyone else seen processes that cannot be killed with
>a sigp -s??  Perhaps I am the only Apollo user with this problem.

Yes. If your application is a print server, and it is waiting to
output via a SIO line, you can sigp it and nothing will happen UNTIL
the sio line lets the task become active. When the OS is waiting
for a sio line, nothing can stop the process. This is the only time
I have seen this. (And I have no great argument about it being
acceptable or not).

>  3.  Does anyone know a way to fake out the ec2_wait tasks and make
>them think they got what they are looking for?  How much damage do you
>think would result if one could do this?

No. It is perhaps possible IF you have the right knowledge and the
apollo version of the /usr/include/apollo files. It is not fixing the
problem, only the symptoms.

>  4.  Does anyone know what blasting processes such as these actually
>does to the operating system??  The server_process_manager sometimes
>exits.  Is this a possible effect of blasting a process such as the
>above??

Blasting is NASTY. It is still recommended that you shut down the node
after blasting.

>  5.  When processes such as the above that are using sio lines are blasted,
>the sio lines are left "locked".  They cannot be unlocked since they are
>not really "locked objects".  I found that copying /dev/sio1 to /dev/siox
>then deleting /dev/sio1 then changing the name of /dev/siox to /dev/sio1
>will restore /dev/sio1 to service.

That is news to me. I just used to shut down the node to the phase 2
shell and re-start.

>  6.  If you are using DANFORD serial lines as well, they become "locked"
>in a similar manner.  Copying them and changing their names does not work.
>The ssiomonitor must be killed and restarted.  This means that all consoles
>served by the ssiomonitor must be shutdown and restarted in order to restart
>one line!

This is not surprising as the manager probably works the same way as the
sio manager. You could also try signalling the ssiomonitor with a quit
fault, to get it to re-try. (The apollo siomonit handles quit faults
specially to get it to re-start the lines)

>  7.  The group id of a forked child process is sometimes set to zero.
>This seems to occur randomly about 20 percent of the time.
>When the parent is killed, the child is not killed.  Has anyone
>else seen this problem?  (This seems to be unrelated to the unkillable
>process problem but perhaps in some way it IS related?)

Is this the same context as above? ie is your non-stoppable application
doing this? IF SO, a possible cause is from forking a multiple-threaded
process. As multiple threads (tasks) are a bit nasty, anything could happen!

>PLEASE HELP

Comments:

I need more information to make a more informed guess! The traceback
you included indicates that the process is definitely running with
tasks. It looks like a print server, but that is just a guess. It could
also be a program which uses NCS services.

TASKs are nasty things. When tasking is enabled in a process, the entire
process must be very carefully written. Some system services behave
differently, and some (esp. UNIX calls) should be avoided as they are not
re-entrant. Further, cleanup handlers should be used to allow the
tasks to be shut down cleanly. If tasks are not shut down cleanly, all
sorts of wierd things can happen. Also, signalling a process has
a different behaviour when tasking is enabled, and the handling is
more complex. (You have no idea which task is active at the time
a signal might be received, so the handling is messy!)

If this is a third party product, I suggest you start writing to the
suppliers of it.

I have seen other postings which suggest re-booting the node every day,
etc. This should not be necessary, and we do not do that. The ony time
I have had to do mass re-boots was during software development when things
go wrong. DOMAIN/OS is pretty stable. However, BLASTing processes is
not recommended by Apollo, so you must accept the consequences...

Ashleigh Quick
AGQ@dstos3.dsto.oz.au