jh@pcsbst (12/11/87)
There is a bug in cron (SVR[23]) which causes cron to hang in a read() of a named pipe. Thus, cron stops to service any crontabs until someone starts a job via "at" or sends a signal SIGALRM to cron. This behavior is caused by an improper usage of signals. Following code fragment is found in cron: ----------------------- msg_wait() { ... alarm((unsigned) t); pmsg = &msgbuf; errno = 0; if((cnt=read(msgfd,pmsg,MSGSIZE)) != MSGSIZE) { if(errno != EINTR) { ... } ... timeout() { signal(SIGALRM, timeout); } ----------------------- This implementation depends on the alarm signal to interrupt the read system call. Imagine a loaded paging-System. If t is equal to 1 second then the signal SIGALRM may well arrive before the read system call is issued. If that happens the hang occurs. This situation is hard to fix, especially if we insist on getting the results of a successful read operation. If we could use select() this problem could be solved without any difficulty. We would not have to deal with the imponderabilies of the UNIX signal mechanism. In the fix we exploit the knowledge that a meaningful message starts with a non-zero character in the etype field. Since the operating system cannot send a signal and copy the result of the read operation into the (read) message buffer, we can use this knowledge to trigger a longjmp() from the signal handling routine. Thus, my solution to this problem looks like: ----------------- char dummy = 1; char * nojump = &dummy; jmp_buf msg_wait_env; msg_wait() { ... if (setjmp(msg_wait_env)) { errno = EINTR; goto msg_wait_intr; } pmsg = &msgbuf; nojump = &pmsg->etype; cnt = -1; /* allow longjmp from timeout() */ *nojump = NULL; /* is equal to "pmsg->etype = NULL" */ alarm((unsigned) t); errno = 0; cnt=read(msgfd,pmsg,MSGSIZE); /* disallow longjmp from timeout() */ nojump = &dummy; if(cnt != MSGSIZE) { msg_wait_intr: if(errno != EINTR) { ... } ... timeout() { signal(SIGALRM, timeout); if (*nojump == NULL) longjmp (msg_wait_env, 1); } ----------------- This works under the assumption that a move to a pointer is an atomic operation for the CPU. Furthermore, a signal should never be sent to a process (in the process of) recovering from a page fault. There is another alarm() situation in connection with a wait() system call in cron which is not clean either. I did not tackle this one, because it is reasonable to expect that a process started by cron exit()s in a reasonable time span. Does someone have a better solution? Any comments? How many similar bugs are still hanging around? Somebody told me that cron in BSD4.? would work ok without select(). Personally, I think that the UNIX signal mechanism is the worst thing that ever hit the programming world (I do not exclude BSD4.?). If you have access to select(), always use select() - even in the form of select (0, NULL, NULL, NULL, tvp); When select() returns it leaves no pending signals hanging over your head. It even tells you whether it was interrupted by a signal. Johannes Heuft unido!pcsbst!jh