anderson@atc.sps.mot.com (howard anderson) (09/18/90)
I really need help. (Apollo Release 10.2 with patch m0118.) I am having difficulty killing certain processes. Sometimes the processes can be killed easily. Sometimes they can't. Some random factor is at work. When whatever it is does go wrong the process cannot be killed. Sigp -q doesn't work. Sigp -s doesn't work. Sigp -b works but sometimes appears to destroy important parts of the operating system so that the node has to be shut down. Here is a traceback of one of these unkillable processes: $ tb -t b.report -------------------- Task 514 "RPC reply acknowledger" (active) In routine system service "ec2_$wait" Called from "task_$sched" line 2702 Called from "task_$ec2_wait" line 1411 Called from "ec2_$wait_svc" line 164 Called from "sleep" line 53 Called from "periodic_task" line 1494 Called from "task_$base_proc" line 874 -------------------- Task 0 "distinguished_task" (distinguished task) (ready) In routine "<UID 00000407.004A006A>" offset 7FFFF55C Called from "task_$ec2_wait" line 1411 Called from "ec2_$wait_svc" line 164 Called from "ec2_$wait" line 203 Called from "task_$handle_mark_release" line 1942 Called from "pm_$proc_release" line 2151 Called from "pgm_$invoke_uid_pn" line 1160 Called from "pm_$init" line 834 -------------------- Task 257 "reaper_task" (waiting) In routine "<UID 00000407.007600B2>" offset 7FFFEE48 Called from "task_$ec2_wait" line 1411 Called from "ec2_$wait_svc" line 164 Called from "ec2_$wait" line 203 Called from "task_$reaper_task" line 2248 Called from "task_$base_proc" line 874 Now it looks to me like these are all Apollo routines and that the user tasks have all been eliminated. Apollo response center people agreed that this was the case. They said that their system routines may be waiting for some resource that a third-party vendor didn't release. Since all user code AND the third party vendor code has been sucessfully blown away at this point it looks like we will be waiting here a long time. (The Apollo response center is closing my call. They told me to contact the third-party vendor because it is obviously a problem in the third-party vendor code.) Questions I have for you are these: 1. This situation runs counter to my philosophy regarding the role of an operating system. The user task has been eliminated by the operating system. So now we wait forever for an event that cannot happen? I would not have expected the operating system to lose control in a case such as this. Are my expectations too high?? 2. Has anyone else seen processes that cannot be killed with a sigp -s?? Perhaps I am the only Apollo user with this problem. 3. Does anyone know a way to fake out the ec2_wait tasks and make them think they got what they are looking for? How much damage do you think would result if one could do this? 4. Does anyone know what blasting processes such as these actually does to the operating system?? The server_process_manager sometimes exits. Is this a possible effect of blasting a process such as the above?? 5. When processes such as the above that are using sio lines are blasted, the sio lines are left "locked". They cannot be unlocked since they are not really "locked objects". I found that copying /dev/sio1 to /dev/siox then deleting /dev/sio1 then changing the name of /dev/siox to /dev/sio1 will restore /dev/sio1 to service. 6. If you are using DANFORD serial lines as well, they become "locked" in a similar manner. Copying them and changing their names does not work. The ssiomonitor must be killed and restarted. This means that all consoles served by the ssiomonitor must be shutdown and restarted in order to restart one line! 7. The group id of a forked child process is sometimes set to zero. This seems to occur randomly about 20 percent of the time. When the parent is killed, the child is not killed. Has anyone else seen this problem? (This seems to be unrelated to the unkillable process problem but perhaps in some way it IS related?) PLEASE HELP!