[comp.sys.hp] HP-UX 7.0 problems with ps

QQ11@LIVERPOOL.AC.UK (08/11/90)

I have a simple csh script which ran without any problems at HP-UX 3.1
on our 9000/850. However at 7.0 the behaviour is different. The different
behaviour has proved consistent enough for me to seek help from you.

The script fetches a file from another system every 24 hours and the
script is started by cron(1M). The process waits until the file arrives.
If for  some reason the  transfer has not taken  place 24 hours  later, I
want the  earlier process killed. The  following line of code  is used to
find the old PID.

set exists=`ps -uqq11 | awk '$4 == "getus.csh" && $1 != x { print $1 }' x=$$ -`

This looks for the PIDs of the program and finds the older one if it
exists (if it is found kill(1) is used).

This worked fine at HP-UX 3.1 but at 7.0 a "spurious" PID is found
(presumably as a result of executing part of the above pipe). The
resulting kill(1) fails since the process is already dead.

Here is output of ps -uqq11 in same script before the pipe:

 11752 ?        0:00 sh
 11942 ?        0:00 ps
 11764 ?        0:00 getus.csh

I would be grateful if any HP-UX 7.0 guru can explain precisely what is
happening i.e. why does the other process get included at 7.0 and not
at 3.1 (for 9000/300 fans, HP-UX 3.1 on the 800 was similar to
6.5 on the 300s).

While I'm aware that I can do the whole thing differently
(send *good* suggestions to me if you want :-)), I'd like to get to
the root of the problem.

Thanks.

Alan Thew
University of Liverpool Computer Laboratory
Bitnet/Earn: QQ11@LIVERPOOL.AC.UK or QQ11%UK.AC.LIVERPOOL @ UKACRL
UUCP       : ....!mcsun!ukc!liv!qq11        Voice: +44 51 794 3735
Internet   : QQ11@LIVERPOOL.AC.UK or QQ11%LIVERPOOL.AC.UK @ NSFNET-RELAY.AC.UK

jonb@hpcupt1.HP.COM (Jon Bayh) (08/13/90)

Alan Thew of the University of Liverpool Computer Laboratory writes:

>                                   The  following line of code  is used to
> find the old PID.
> 
> set exists=`ps -uqq11 | awk '$4 == "getus.csh" && $1 != x { print $1 }' x=$$ -`
> 
> This looks for the PIDs of the program and finds the older one if it
> exists (if it is found kill(1) is used).
> 
> This worked fine at HP-UX 3.1 but at 7.0 a "spurious" PID is found
> (presumably as a result of executing part of the above pipe). The
> resulting kill(1) fails since the process is already dead.

Mr. Thew,
	I'm afraid that you are running into a race condition in the
pipe code above, as you surmised.  The csh is spawning off two sub-processes,
one that will become the ps process, and one that will become the awk
process.  Between the time of the fork() and the time that the forked csh
performs the exec(), however, the process inherits the name of the original
process, "getus.csh".  The race condition occurs when the ps process is
exec'ed first, and begins to run before the awk process has had a chance
to perform its exec.  During this window, the script will find three
processes that are named "getus.csh"---the old one that you want to kill,
the parent of the above script (which will be discarded by the $1 != x case
above), and the forked csh process that will eventually become the awk
process.  This race condition also existed in 3.1, but the 7.0 'ps' is much,
much faster than the 3.1 'ps', and that probably causes the race to show
up more.

	One simple (but rather kludgy) fix to the problem is to change the
"$1 != x" comparison above to something like "($1 < x || $1 > x+5)".  That
takes advantage of the fact that the shell script and its ps and awk
subprocesses will probably be spawned off quickly and with sequential
PIDs.  It may not work if the system is busy spawning processes, if the
system has had some long time processes around that happen to match the
PIDs that are being allocated, if the older getus.csh happens to match
the PIDs being allocated, or if a future system does not allocate PIDs
sequentially.

	Another, better, solution would be to make the existence test a
separate script with a different name, perhaps "findgetus.csh".  That way,
its name will not conflict with the name of the parent script.  Since
the parent of the ps and awk processes would have the name "findgetus.csh",
they wouldn't match the target string and the race condition wouldn't matter.
The parent could pass its PID as a parameter to the find script so that
the find script would avoid the parent shell process.

			Jon Bayh
			jonb@hpda

tgl@zog.cs.cmu.edu (Tom Lane) (08/16/90)

In article <-286539949@hpcupt1.HP.COM>, jonb@hpcupt1.HP.COM (Jon Bayh) writes:
> Alan Thew of the University of Liverpool Computer Laboratory writes:
> > 
> > set exists=`ps -uqq11 | awk '$4 == "getus.csh" && $1 != x { print $1 }' x=$$ -`
> > 
> > This worked fine at HP-UX 3.1 but at 7.0 a "spurious" PID is found
> > (presumably as a result of executing part of the above pipe). The
> > resulting kill(1) fails since the process is already dead.
> 
> [Jon explains that the ps is spotting the subprocess forked to exec() awk,
>  before the latter has been able to do the exec; it still has the shell
>  script name.  He suggests a couple of rather ugly solutions.]

I've run into related problems in scripts for regular sh.  A cleaner
fix than either of Jon's is to direct ps's output into a temporary file:

	ps >/tmp/psout$$
	set exists=`awk 'awk program' </tmp/psout$$`
	rm -f /tmp/psout$$

This way there are no extra processes laying about at the instant ps runs,
so you need not worry about distinguishing them from the parent.

Incidentally, if the job you are trying to kill is a shell script, you need
to worry about making sure that what you find is the parent process, not a
subprocess ... the awk script could be extended to choose the right target
by looking at parent process IDs, but it wouldn't be a trivial change.  I
haven't drunk enough Coke yet this morning to be able to work it out :-)

-- 
				tom lane
Internet: tgl@cs.cmu.edu
UUCP: <your favorite internet/arpanet gateway>!cs.cmu.edu!tgl
BITNET: tgl%cs.cmu.edu@cmuccvma
CompuServe: >internet:tgl@cs.cmu.edu