anderson@atc.sps.mot.com (howard anderson) (09/19/90)
I really need help. (Apollo Release 10.2 with patch m0118.) I am having difficulty killing certain processes. Sometimes the processes can be killed easily. Sometimes they can't. Some random factor is at work. When whatever it is does go wrong the process cannot be killed. Sigp -q doesn't work. Sigp -s doesn't work. Sigp -b works but sometimes appears to destroy important parts of the operating system so that the node has to be shut down. Here is a traceback of one of these unkillable processes: $ tb -t b.report -------------------- Task 514 "RPC reply acknowledger" (active) In routine system service "ec2_$wait" Called from "task_$sched" line 2702 Called from "task_$ec2_wait" line 1411 Called from "ec2_$wait_svc" line 164 Called from "sleep" line 53 Called from "periodic_task" line 1494 Called from "task_$base_proc" line 874 -------------------- Task 0 "distinguished_task" (distinguished task) (ready) In routine "<UID 00000407.004A006A>" offset 7FFFF55C Called from "task_$ec2_wait" line 1411 Called from "ec2_$wait_svc" line 164 Called from "ec2_$wait" line 203 Called from "task_$handle_mark_release" line 1942 Called from "pm_$proc_release" line 2151 Called from "pgm_$invoke_uid_pn" line 1160 Called from "pm_$init" line 834 -------------------- Task 257 "reaper_task" (waiting) In routine "<UID 00000407.007600B2>" offset 7FFFEE48 Called from "task_$ec2_wait" line 1411 Called from "ec2_$wait_svc" line 164 Called from "ec2_$wait" line 203 Called from "task_$reaper_task" line 2248 Called from "task_$base_proc" line 874 Now it looks to me like these are all Apollo routines and that the user tasks have all been eliminated. Apollo response center people agreed that this was the case. They said that their system routines may be waiting for some resource that a third-party vendor didn't release. Since all user code AND the third party vendor code has been sucessfully blown away at this point it looks like we will be waiting here a long time. (The Apollo response center is closing my call. They told me to contact the third-party vendor because it is obviously a problem in the third-party vendor code.) Questions I have for you are these: 1. This situation runs counter to my philosophy regarding the role of an operating system. The user task has been eliminated by the operating system. So now we wait forever for an event that cannot happen? I would not have expected the operating system to lose control in a case such as this. Are my expectations too high?? 2. Has anyone else seen processes that cannot be killed with a sigp -s?? Perhaps I am the only Apollo user with this problem. 3. Does anyone know a way to fake out the ec2_wait tasks and make them think they got what they are looking for? How much damage do you think would result if one could do this? 4. Does anyone know what blasting processes such as these actually does to the operating system?? The server_process_manager sometimes exits. Is this a possible effect of blasting a process such as the above?? 5. When processes such as the above that are using sio lines are blasted, the sio lines are left "locked". They cannot be unlocked since they are not really "locked objects". I found that copying /dev/sio1 to /dev/siox then deleting /dev/sio1 then changing the name of /dev/siox to /dev/sio1 will restore /dev/sio1 to service. 6. If you are using DANFORD serial lines as well, they become "locked" in a similar manner. Copying them and changing their names does not work. The ssiomonitor must be killed and restarted. This means that all consoles served by the ssiomonitor must be shutdown and restarted in order to restart one line! 7. The group id of a forked child process is sometimes set to zero. This seems to occur randomly about 20 percent of the time. When the parent is killed, the child is not killed. Has anyone else seen this problem? (This seems to be unrelated to the unkillable process problem but perhaps in some way it IS related?) PLEASE HELP
zeleznik@cs.utah.edu (Mike Zeleznik) (09/19/90)
In article <2414@dover.sps.mot.com> anderson@atc.sps.mot.com (howard anderson) writes: > >I really need help. (Apollo Release 10.2 with patch m0118.) >I am having difficulty killing certain processes. ... >Sigp -q doesn't work. Sigp -s doesn't work. Sigp -b works but ... > 2. Has anyone else seen processes that cannot be killed with >a sigp -s?? Perhaps I am the only Apollo user with this problem. I have been using Apollos for over 7 years now, and processes that have to be blasted (sigp -b) have come to be (from our experience) a fact of life. Though it appears to occur less often these days. > 4. Does anyone know what blasting processes such as these actually >does to the operating system?? I don't know the specifics, but Apollo used to clearly recommend that you reboot the node as soon as possible after blasting a process, since it left inconsistant state in there, which would cause problems down the line. I don't know if they still say this. It used to be that after enough blasts, the OS would start to do strange things (as you have seen), and a reboot usually cleared it all up. There was a point a few years ago when I used to simply reboot my workstation each morning on general principle, just to avoid having to find the problems the hard way. Things seem to be much more robust these days with 10.2. Mike Michael Zeleznik Computer Science Dept. University of Utah zeleznik@cs.utah.edu Salt Lake City, UT 84112 (801) 581-5617
roger@GW1.AGS.BNL.GOV (Roger A. Katz) (09/19/90)
In article <2414@dover.sps.mot.com> anderson@atc.sps.mot.com (howard anderson) writes: > >I really need help. (Apollo Release 10.2 with patch m0118.) >I am having difficulty killing certain processes. ... >Sigp -q doesn't work. Sigp -s doesn't work. Sigp -b works but ... question: who is the 'owner' of these processes? try using the command /etc/server /com/sigp xxxxx -s if the 'owner' is user.server also i found that logged in as root, I can sigp anything. Email: roger@gw1.ags.bnl.gov Roger A. Katz AGS Software Controls Group Brookhaven National Laboratory Upton, N.Y. 11973-5000 (516) 282-2732 I'm sure, I may be wrong, but I'm sure.
krowitz@RICHTER.MIT.EDU (David Krowitz) (09/19/90)
If you can not kill a process because of the ownership of the process, you will get a message at SR10 of something like "not owner". We too, have seen processes hang forever with a "sigp -s" failing to kill the process. It frequently comes from a cleanup handler waiting for an event count (file I/O completed, etc.) to be advanced. "sigp -b" is almost always guaranteed to require a reboot sooner or later. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter.mit.edu@eddie.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
chen@digital.sps.mot.com (Jinfu Chen) (09/20/90)
>question: who is the 'owner' of these processes? >try using the command >/etc/server /com/sigp xxxxx -s >if the 'owner' is user.server >also i found that logged in as root, I can sigp anything. Before SR10.2, anyone can sigp any process (except display_manager?) but kill does check ownership. Sigp behaves better (in terms of ownership) under SR10.2.
system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (09/22/90)
This is a fairly common problem on our DN10020, and has been ever since we got it in Jan/89. We have also seen this on M68K nodes, but not very often since SR10.2 came out. Our DN10000 is running SR10.2.0.5, and this problems seems to come and go in severity with different OS patches, but never disappears. Neither 'kill', 'kill -9' or 'sigp -stop' has any effect on the process, but 'sigp -blast' always works as of SR10.2.p (it didn't before!). Of course, you better shut down and reboot while you still can once you blast anything. The current rumor is that this is a problem on dual cpu 10K's - is anyone still seeing this on any other configuration? If you call this problem into the HOTLINE, try mentioning call # AT002411 or A2037038, which have been open for several months already, so the lucky person at the Response Center can get up to speed right away :-). -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775
agq@itd1.dsto.oz (Ashley Quick) (09/25/90)
anderson@atc.sps.mot.com (howard anderson) writes: >I really need help. (Apollo Release 10.2 with patch m0118.) >I am having difficulty killing certain processes. Sometimes the processes >can be killed easily. Sometimes they can't. Some random factor is [... things deleted] >Now it looks to me like these are all Apollo routines and that the >user tasks have all been eliminated. Apollo response center people >agreed that this was the case. They said that their system routines You have not said what your application program is. It looks at lot like a print server to me. Knowing the application would help a lot. >may be waiting for some resource that a third-party vendor didn't release. >Since all user code AND the third party vendor code has been sucessfully >blown away at this point it looks like we will be waiting here a long time. >(The Apollo response center is closing my call. They told me to contact >the third-party vendor because it is obviously a problem in the third-party >vendor code.) Is the trace back done AFTER you have sigp'd the process? What flags were used to sigp it? >Questions I have for you are these: > 1. This situation runs counter to my philosophy regarding the >role of an operating system. The user task has been eliminated >by the operating system. So now we wait forever for an event >that cannot happen? I would not have expected the operating >system to lose control in a case such as this. Are my expectations >too high?? DOMAIN/OS supports multiple threads per task. The manuals make it very clear that when this is used, special consideration needs to be given to cleaning up the mess IF THE PROCESS IS aborted for any reason. (This includes signalling it). It would appear that the application developer has not followed the guidelines. > 2. Has anyone else seen processes that cannot be killed with >a sigp -s?? Perhaps I am the only Apollo user with this problem. Yes. If your application is a print server, and it is waiting to output via a SIO line, you can sigp it and nothing will happen UNTIL the sio line lets the task become active. When the OS is waiting for a sio line, nothing can stop the process. This is the only time I have seen this. (And I have no great argument about it being acceptable or not). > 3. Does anyone know a way to fake out the ec2_wait tasks and make >them think they got what they are looking for? How much damage do you >think would result if one could do this? No. It is perhaps possible IF you have the right knowledge and the apollo version of the /usr/include/apollo files. It is not fixing the problem, only the symptoms. > 4. Does anyone know what blasting processes such as these actually >does to the operating system?? The server_process_manager sometimes >exits. Is this a possible effect of blasting a process such as the >above?? Blasting is NASTY. It is still recommended that you shut down the node after blasting. > 5. When processes such as the above that are using sio lines are blasted, >the sio lines are left "locked". They cannot be unlocked since they are >not really "locked objects". I found that copying /dev/sio1 to /dev/siox >then deleting /dev/sio1 then changing the name of /dev/siox to /dev/sio1 >will restore /dev/sio1 to service. That is news to me. I just used to shut down the node to the phase 2 shell and re-start. > 6. If you are using DANFORD serial lines as well, they become "locked" >in a similar manner. Copying them and changing their names does not work. >The ssiomonitor must be killed and restarted. This means that all consoles >served by the ssiomonitor must be shutdown and restarted in order to restart >one line! This is not surprising as the manager probably works the same way as the sio manager. You could also try signalling the ssiomonitor with a quit fault, to get it to re-try. (The apollo siomonit handles quit faults specially to get it to re-start the lines) > 7. The group id of a forked child process is sometimes set to zero. >This seems to occur randomly about 20 percent of the time. >When the parent is killed, the child is not killed. Has anyone >else seen this problem? (This seems to be unrelated to the unkillable >process problem but perhaps in some way it IS related?) Is this the same context as above? ie is your non-stoppable application doing this? IF SO, a possible cause is from forking a multiple-threaded process. As multiple threads (tasks) are a bit nasty, anything could happen! >PLEASE HELP Comments: I need more information to make a more informed guess! The traceback you included indicates that the process is definitely running with tasks. It looks like a print server, but that is just a guess. It could also be a program which uses NCS services. TASKs are nasty things. When tasking is enabled in a process, the entire process must be very carefully written. Some system services behave differently, and some (esp. UNIX calls) should be avoided as they are not re-entrant. Further, cleanup handlers should be used to allow the tasks to be shut down cleanly. If tasks are not shut down cleanly, all sorts of wierd things can happen. Also, signalling a process has a different behaviour when tasking is enabled, and the handling is more complex. (You have no idea which task is active at the time a signal might be received, so the handling is messy!) If this is a third party product, I suggest you start writing to the suppliers of it. I have seen other postings which suggest re-booting the node every day, etc. This should not be necessary, and we do not do that. The ony time I have had to do mass re-boots was during software development when things go wrong. DOMAIN/OS is pretty stable. However, BLASTing processes is not recommended by Apollo, so you must accept the consequences... Ashleigh Quick AGQ@dstos3.dsto.oz.au