[comp.sys.sun] Cron bug on the SS-1 under 4.0.3c summary, long

ajudge@maths.tcd.ie (02/09/90)

Here is a summary of the replies I have received about a cron bug which
causes some cron jobs to be run twice.

The bug is acknowledged by Sun and a patch is available, but even after
the patch the problem still recurs.

>>> cron:1

X-From:      bengts@Sweden.Sun.com

It's a bug in the cron program, bugid #1022379. You can get a new cron
program from your local answercentre.  This new cron also works on the
4/390 but not on the 4/330.  On the 4/330 you patch a value in the kernel.

>>> cron:3

X-From:      jay@silence.princeton.nj.us

Yes, this is a known problem.  It affects all Suns (bug in the SysV
version of cron in SunOS) but it bites the 4/60 and the 386i more than
others because of some kernel workaround for a hardware problem (the
details I've forgotten).  On the 4/60, I believe that the problem is
partly that the clock is frequently reset.  Sun has supplied fixes for
some architectures but not, to my knowledge, for the 4/60.  If you run ntp
the problem will become even more severe.

The only current workaround (necessary even with Suns patch if you run
ntp) is to wrap each cron job with a shell script or program which creates
a lockfile to prevent duplicate invocations.

Here's an example locker.  Fancier than anything you really need, but you
can weed out the cruft:
*** cut ***
#! /bin/csh -f

# Prevent cron from executing jobs twice

unset MAILTO
set JOBSHELL = "/bin/sh -c"

goto start
usage:
echo Usage: `basename $0` '[options] lockname command ...\
Options:\
	-m mailto	mail output to "mailto"\
	-s shell	execute command with "shell"\
	-c		execute command with "csh -c"\
	-C		execute command with "csh -cf"'
exit

start:
set CMD = "$0 $*"
set parsing = 1
while ( $parsing )
	if ( $#argv < 2 ) then
		goto usage
	endif
	switch ( x$1 )
		case x-m:
			set MAILTO = "$2"
			shift; shift
			breaksw
		case x-s:
			set JOBSHELL = "$2"
			shift; shift
			breaksw
		case x-c:
			set JOBSHELL = "/bin/csh -c"
			shift
			breaksw
		case x-C:
			set JOBSHELL = "/bin/csh -cf"
			shift
			breaksw
		case x-*:
			goto usage
			breaksw
		default:
			set parsing = 0
			breaksw
	endsw
end

set LOCK = /tmp/$1.cronlock.$LOGNAME
echo $$ > $LOCK
sleep 60

set OUT = /tmp/$1.$$.$LOGNAME
touch $OUT
chmod 600 $OUT
shift

if ( -e $LOCK ) then
	if ( x$$ == x`cat $LOCK` ) then
		$JOBSHELL "$*" >& $OUT
		rm -f $LOCK
		goto wrapup
	endif
endif
echo "Passing the buck." > $OUT

wrapup:
if ( ! -z $OUT ) then
	if ( $?MAILTO ) then
		/usr/ucb/Mail -s "Cron job (`hostname`): $CMD" "$MAILTO" < $OUT
	else
		echo "Cron job (`hostname`)"
		cat $OUT
	endif
endif
rm -f $OUT

>>> cron:6

X-From:      alex <alexl%daemon.cna.tek.com@RELAY.CS.net>

> From: Ed Anselmo <anselmo-ed@yale.edu>
> Subject: Re: cron running twice
>
> Sun is offering a patched version of cron.  Part of the README file follows:
>
> Bugs Fixed:
> ------------
> 1.  cron.c:
>     1019719:  print at(1) job number in syslog messages
>     1023418:  cron queue handling and scheduling is broken
>     1012011:  Initialize USER as well as LOGNAME environment variable
>     1017698:  cron sends erroneous error message when job can't be executed
>     1014181:  add pid and queue name to the CMD syslog message
>     1012398:  "cron"/"at"/"batch" runs more jobs than queue limit
>     1022379:  cron executes crontab entries twice  (duplicate of 1027075)
>
> 2.  funcs.c:
>     1011113:  invalid sys_errlist message number is >= sys_nerr, not >
 sys_nerr
>
> (We received this through the standard support channels, i.e. hotline@sun.com
)
> --
> Ed Anselmo   anselmo-ed@cs.yale.edu   {harvard,decvax}!yale!anselmo-ed
>
>
> From: Dan Lorenzini <uunet.uu.net!gcm!amadeus!dal@tektronix.TEK.COM>
> To: uunet!eecs.nwu.edu!sun-managers@uunet.uu.net
> Subject: Re: Double Cron
> In-Reply-To: Your message of Tue, 07 Nov 89 14:45:07 -0800.
> Date: Wed, 08 Nov 89 11:42:51 -0500
> Status: OR
>
>
> Re: cron doing things twice:
>
> The way I heard it, it is a hardware problem (the Sparcstation is too
> fast) :-)
>
> Apparently, there was a problem with 4.0 cron executing jobs twice
> (there was (still is?) a problem with calendar also).  Sun patched it,
> but it still has the problem on the Sparcstation-1's that we have
> here.
>
> Sun sent me a workaround -- I haven't used it yet, but here it is in
> case anybody needs it:
>
> ------------------------------------------------------------------------
> 	#!/bin/sh
> 	LOCK=/tmp/.mumble-lock
> 	echo $$ > ${LOCK}
> 	sleep 60
> 	if [ $$ = `cat ${LOCK}` ]; then
> 		# I get to do it
> 		rm ${LOCK}
> 	else
> 		# The other process gets to do it
> 		exit
> 	fi
> 	# Actually do whatever you wanted to do...
> ------------------------------------------------------------------------
>
> Dan Lorenzini
> uunet!gcm!dal
>

I put in the a fixed cron and still get the problem on occasion.

>>> cron:9

X-From:      dmc%cam.sri.com@Warbucks.AI.SRI.com

You probably know this by now, but this is a Sun bug (Sparcstations are
too fast for the software or something). There is a workaround; replace
"mycommand blah blah" by "safe_cron mycommand blah blah" in your crontab,
where safe_cron is the following script:

#!/bin/sh
# Workaround for the bug where cron jobs sometimes get run twice, a
# minute apart, on Sparcstations.
if [ `arch` = sun4 ]
  then
LOCK=/tmp/.`basename $1`.lock
echo $$ > ${LOCK}
sleep 60
if [ "$$" = "`cat ${LOCK}`" ]; then
  # I get to do it
  rm ${LOCK}
else
  # The other process gets to do it
  exit
fi
# Actually do it
$*
else
# not a Sun4, just do it
$*
fi

=======================================================================

Alan Judge, SysAdmin, Dept. of Maths, Trinity College, Dublin, Ireland.
    ajudge@maths.tcd.ie  a.k.a.  amjudge@cs.tcd.ie
also, Distributed System Group, Dept. of Computer Science, TCD.

dworkin@solbourne.com (Dieter Muller) (02/11/90)

In article <4887@brazos.Rice.edu> ajudge@maths.tcd.ie writes:
>X-Sun-Spots-Digest: Volume 9, Issue 36, message 12
>
>Here is a summary of the replies I have received about a cron bug which
>causes some cron jobs to be run twice.
>
>The bug is acknowledged by Sun and a patch is available, but even after
>the patch the problem still recurs.

The `bug' appears to be a kernel problem, technically.  What happens is
that cron does a sleep for N seconds, but wakes up after N-1 seconds.  It
starts the next job (the one we're a second early for), and then performs
a reschedule.  Well, since the time for the `next' job (the one we just
started) hasn't arrived yet, put it back at the front of the list, sleep 1
second, and poof!  You just ran the job twice....

This is a side-effect of the mechanism for user crontabs.  Specifically,
while cron is `sleeping', it's really waiting in a select for messages on
a named pipe.  If a message came in (user X's crontab changed, etc), it
handles that and goes back into the select.  If the select timed out, cron
assumes the timer expired, and that no other external event occurred to
fake out the timer.

A simple way to demonstrate the problem is to send a SIGALRM to the cron
process.  And, as mentioned above, the official Sun fixes don't.

The correct fix is for cron to check the time after a select time out.  If
the desired time hasn't yet occurred, reset the timer and go back to the
select (basically, act like a null message came in on the pipe).  I put
this fix into Solbourne's version of cron, and we haven't heard of the
problem recurring since then.

	Dworkin

boulder!stan!dworkin  dworkin%stan@boulder.colorado.edu  dworkin@solbourne.com