[comp.bugs.sys5] BUG in cron

jh@pcsbst (12/11/87)

There is a bug in cron (SVR[23]) which causes cron to hang in a read()
of a named pipe. Thus, cron stops to service any crontabs until
someone starts a job via "at" or sends a signal SIGALRM to cron.

This behavior is caused by an improper usage of signals.
Following code fragment is found in cron:

-----------------------

msg_wait()
{
	...

	alarm((unsigned) t);
	pmsg = &msgbuf;
	errno = 0;
	if((cnt=read(msgfd,pmsg,MSGSIZE)) != MSGSIZE) {
		if(errno != EINTR) {
	...
}

...

timeout()
{
	signal(SIGALRM, timeout);
}

-----------------------

This implementation depends on the alarm signal to interrupt
the read system call. Imagine a loaded paging-System. If t is
equal to 1 second then the signal SIGALRM may well arrive
before the read system call is issued. If that happens the
hang occurs.

This situation is hard to fix, especially if we insist on getting
the results of a successful read operation. If we could use
select() this problem could be solved without any difficulty.
We would not have to deal with the imponderabilies of the UNIX
signal mechanism.

In the fix we exploit the knowledge that a meaningful message
starts with a non-zero character in the etype field. Since
the operating system cannot send a signal and copy the result
of the read operation into the (read) message buffer, we can
use this knowledge to trigger a longjmp() from the signal
handling routine.

Thus, my solution to this problem looks like:

-----------------

char dummy = 1;
char * nojump = &dummy;
jmp_buf msg_wait_env;

msg_wait()
{
	...

	if (setjmp(msg_wait_env)) {
		errno = EINTR;
		goto msg_wait_intr;
	}
	pmsg = &msgbuf;
	nojump = &pmsg->etype;
	cnt = -1;

	/* allow longjmp from timeout() */
	*nojump = NULL; /* is equal to "pmsg->etype = NULL" */

	alarm((unsigned) t);
	errno = 0;
	cnt=read(msgfd,pmsg,MSGSIZE);

	/* disallow longjmp from timeout() */
	nojump = &dummy;

	if(cnt != MSGSIZE) {
msg_wait_intr:
		if(errno != EINTR) {
	...
}

...

timeout()
{
	signal(SIGALRM, timeout);
	if (*nojump == NULL)
		longjmp (msg_wait_env, 1);
}

-----------------

This works under the assumption that a move to a pointer is an atomic
operation for the CPU. Furthermore, a signal should never be sent
to a process (in the process of) recovering from a page fault.

There is another alarm() situation in connection with a wait() system
call in cron which is not clean either. I did not tackle this one,
because it is reasonable to expect that a process started by
cron exit()s in a reasonable time span.

Does someone have a better solution? Any comments?
How many similar bugs are still hanging around?
Somebody told me that cron in BSD4.? would work ok without select().

Personally, I think that the UNIX signal mechanism is the worst
thing that ever hit the programming world (I do not exclude BSD4.?).
If you have access to select(), always use select() - even in the form of

	select (0, NULL, NULL, NULL, tvp);

When select() returns it leaves no pending signals hanging over your head.
It even tells you whether it was interrupted by a signal.

		Johannes Heuft
		unido!pcsbst!jh