QQ11@LIVERPOOL.AC.UK (08/11/90)
I have a simple csh script which ran without any problems at HP-UX 3.1 on our 9000/850. However at 7.0 the behaviour is different. The different behaviour has proved consistent enough for me to seek help from you. The script fetches a file from another system every 24 hours and the script is started by cron(1M). The process waits until the file arrives. If for some reason the transfer has not taken place 24 hours later, I want the earlier process killed. The following line of code is used to find the old PID. set exists=`ps -uqq11 | awk '$4 == "getus.csh" && $1 != x { print $1 }' x=$$ -` This looks for the PIDs of the program and finds the older one if it exists (if it is found kill(1) is used). This worked fine at HP-UX 3.1 but at 7.0 a "spurious" PID is found (presumably as a result of executing part of the above pipe). The resulting kill(1) fails since the process is already dead. Here is output of ps -uqq11 in same script before the pipe: 11752 ? 0:00 sh 11942 ? 0:00 ps 11764 ? 0:00 getus.csh I would be grateful if any HP-UX 7.0 guru can explain precisely what is happening i.e. why does the other process get included at 7.0 and not at 3.1 (for 9000/300 fans, HP-UX 3.1 on the 800 was similar to 6.5 on the 300s). While I'm aware that I can do the whole thing differently (send *good* suggestions to me if you want :-)), I'd like to get to the root of the problem. Thanks. Alan Thew University of Liverpool Computer Laboratory Bitnet/Earn: QQ11@LIVERPOOL.AC.UK or QQ11%UK.AC.LIVERPOOL @ UKACRL UUCP : ....!mcsun!ukc!liv!qq11 Voice: +44 51 794 3735 Internet : QQ11@LIVERPOOL.AC.UK or QQ11%LIVERPOOL.AC.UK @ NSFNET-RELAY.AC.UK
jonb@hpcupt1.HP.COM (Jon Bayh) (08/13/90)
Alan Thew of the University of Liverpool Computer Laboratory writes: > The following line of code is used to > find the old PID. > > set exists=`ps -uqq11 | awk '$4 == "getus.csh" && $1 != x { print $1 }' x=$$ -` > > This looks for the PIDs of the program and finds the older one if it > exists (if it is found kill(1) is used). > > This worked fine at HP-UX 3.1 but at 7.0 a "spurious" PID is found > (presumably as a result of executing part of the above pipe). The > resulting kill(1) fails since the process is already dead. Mr. Thew, I'm afraid that you are running into a race condition in the pipe code above, as you surmised. The csh is spawning off two sub-processes, one that will become the ps process, and one that will become the awk process. Between the time of the fork() and the time that the forked csh performs the exec(), however, the process inherits the name of the original process, "getus.csh". The race condition occurs when the ps process is exec'ed first, and begins to run before the awk process has had a chance to perform its exec. During this window, the script will find three processes that are named "getus.csh"---the old one that you want to kill, the parent of the above script (which will be discarded by the $1 != x case above), and the forked csh process that will eventually become the awk process. This race condition also existed in 3.1, but the 7.0 'ps' is much, much faster than the 3.1 'ps', and that probably causes the race to show up more. One simple (but rather kludgy) fix to the problem is to change the "$1 != x" comparison above to something like "($1 < x || $1 > x+5)". That takes advantage of the fact that the shell script and its ps and awk subprocesses will probably be spawned off quickly and with sequential PIDs. It may not work if the system is busy spawning processes, if the system has had some long time processes around that happen to match the PIDs that are being allocated, if the older getus.csh happens to match the PIDs being allocated, or if a future system does not allocate PIDs sequentially. Another, better, solution would be to make the existence test a separate script with a different name, perhaps "findgetus.csh". That way, its name will not conflict with the name of the parent script. Since the parent of the ps and awk processes would have the name "findgetus.csh", they wouldn't match the target string and the race condition wouldn't matter. The parent could pass its PID as a parameter to the find script so that the find script would avoid the parent shell process. Jon Bayh jonb@hpda
tgl@zog.cs.cmu.edu (Tom Lane) (08/16/90)
In article <-286539949@hpcupt1.HP.COM>, jonb@hpcupt1.HP.COM (Jon Bayh) writes: > Alan Thew of the University of Liverpool Computer Laboratory writes: > > > > set exists=`ps -uqq11 | awk '$4 == "getus.csh" && $1 != x { print $1 }' x=$$ -` > > > > This worked fine at HP-UX 3.1 but at 7.0 a "spurious" PID is found > > (presumably as a result of executing part of the above pipe). The > > resulting kill(1) fails since the process is already dead. > > [Jon explains that the ps is spotting the subprocess forked to exec() awk, > before the latter has been able to do the exec; it still has the shell > script name. He suggests a couple of rather ugly solutions.] I've run into related problems in scripts for regular sh. A cleaner fix than either of Jon's is to direct ps's output into a temporary file: ps >/tmp/psout$$ set exists=`awk 'awk program' </tmp/psout$$` rm -f /tmp/psout$$ This way there are no extra processes laying about at the instant ps runs, so you need not worry about distinguishing them from the parent. Incidentally, if the job you are trying to kill is a shell script, you need to worry about making sure that what you find is the parent process, not a subprocess ... the awk script could be extended to choose the right target by looking at parent process IDs, but it wouldn't be a trivial change. I haven't drunk enough Coke yet this morning to be able to work it out :-) -- tom lane Internet: tgl@cs.cmu.edu UUCP: <your favorite internet/arpanet gateway>!cs.cmu.edu!tgl BITNET: tgl%cs.cmu.edu@cmuccvma CompuServe: >internet:tgl@cs.cmu.edu