pcl%robots.oxford.ac.uk@nsfnet-relay.ac.uk (Paul Leyland) (03/02/90)
Can anyone explain the problem described below, and suggest a fix for it, rather than the palliative I knocked together this morning? We have a 4/390 running 4.0.3 and including all sendmail patches that were available as at the end of February. We used to see this on our old 3/280 though, so I don't think it has anything to do with the CPU architecture. Every so often, a process runs wild and takes all available CPU time until killed, thus preventing highly-niced background jobs from doing anything useful. Last night, one ran for 352 minutes before being stopped by hand. The process name, as shown by "ps ax", is of the form "-AA#####" where ##### is the PID. Sometimes, but not always, there is a /usr/lib/sendmail running full-tilt as well. In /var/spool/mqueue, there are corresponding spool files. They're normally empty, but occasionally they have a fragment of incoming mail headers. I've not noticed any correlation with other system activity, nor with the state of health of other machines on the local ethernet. I've taken to running the following script from cron every 15 minutes which, while being mildly gruesome, does at least let us get some work done. Paul Leyland 8<---------------------------------------------------------------- #! /usr/bin/csh -fb # Flush free-running mail queue items. 2-Mar-1990 by P.C. Leyland if ($#argv != 1) then echo Usage $0 time exit 1 endif if ($1 <= 20) then echo Time must be greater than 20 seconds, you gave $1 exit 1 endif while (1) set COMMAND = `ps ax | grep 'AA[0-9][0-9][0-9][0-9][0-9]' | head -1` if ($#COMMAND == 0) exit 0 # There is one. Now, has it been running for too long? set noglob set RUNTIME = `echo $COMMAND[3] | sed -e 's?:? * 60 + ?'` unset noglob @ SECONDS = $RUNTIME if ($SECONDS >= $1) then kill -9 $COMMAND[1] rm -f /var/spool/mqueue/*{$COMMAND[1]} endif end