steve@eleazar.dartmouth.edu (Steve Campbell) (02/18/89)
Subject: Timing error causes sendmail to die (5.59, 5.61) +FIX Index: usr.lib/sendmail/src/usersmtp.c 4.3BSD Description: The symptom was that sendmail would sometimes die during queue runs, leaving a lockfile set on some message. There were no errors in syslog and no core file. Running the queue manually shows that the problem is a longjmp botch. Since setevent/clrevent/tick are the main users of longjmp, I ran more manual queue runs with -d5.5 set to get event data. Here are the pertinent portions of such a run that ended with the botch. I have added line numbers for reference. Running AA28991 <info-prime%ais1@ecla.usc.edu>... Connecting to eclc.usc.edu (tcp)... 1 setevent: intvl=300, for=602974135, func=bb48, arg=0, ev=689c0 tick: now=602973835 2 setevent: intvl=1800, for=602975635, func=c9f0, arg=0, ev=689e0 tick: now=602973836 3 clrevent: ev=689e0 tick: now=602973836 220-ECLC.USC.EDU SMTP Service 5.1(15)-5 at Wed 8 Feb 89 12:41:28-PST 4 setevent: intvl=1800, for=602975636, func=c9f0, arg=0, ev=689e0 tick: now=602973837 5 tick: now=602974136 >>> QUIT 6 setevent: intvl=1800, for=602975936, func=c9f0, arg=0, ev=689c0 tick: now=602974137 7 clrevent: ev=689c0 tick: now=602974149 <info-prime%ais1@ecla.usc.edu>... reply: read error <info-prime%ais1@ecla.usc.edu>... Deferred: Bad file number ...25 minutes later... Chris Bogart <bogartc@handel.cs.colostate.edu>... Connecting to handel.ColoState.EDU (tcp)... 8 tick: now=602975636 longjmp botch Here's what happens line-by-line. 1 Smtpinit in usersmtp.c calls setevent with a 5 minute timeout for greeting wait. It calls reply which calls sfgets. 2 sfgets in util.c calls setevent with a 30 minute timeout for reads. 3 The read is quick, and sfgets calls clrevent to clear the event set at line 2. 4 The line read happens to be a continuation line, so reply calls sfgets again. Sfgets again sets a 30 minute timeout. 5 Now the network goes to hell, and the next line is never received. Five minutes after the event set in line 1, the timer goes off. Sendmail gives up on the host and sends a QUIT. 6,7 An event is set for a response to the QUIT, but the connection gets closed down in 10 seconds. Sendmail goes on to the next message. The crucial point is that the event set in line 4 has never been cleared. Since the queue is big and the network slow, sendmail is still running 25 minutes later when, at line 8, that event goes off. The functions involved have been called and returned many times, so when the function that tick calls does a longjmp, zap! Repeat-By: Have sendmail do a queue run that takes more than half an hour on a lossy network. Sooner or later, you will find orphan lockfiles. Fix: The nested calls to setevent, with a subroutine (sfgets) setting a longer timeout than its caller (smtpinit), are the problem here. I can think of several fixes. One is to remove the 5-minute timer completely; let the greeting timeout in 30 minutes like all the other reads. A second fix would be to change the event structure so that when one event is cleared, it can also clear any obsolete events. Since events are only used here and in timing the queue runs, this fix seems like a lot of work for a seldom-used feature. A third fix - the one I used - is to have smtpinit temporarily change the official ReadTimeout that sfgets is going to use, rather than have smtpinit call setevent itself. While this is not the most elegant solution, it is simple and works. I have been running it on our hub for more than a week. There have been no more orphan lockfiles (I was getting about 1 a day), and nothing else seems to have broken. Here are the patches against 5.61: *** /tmp/,RCSt1006607 Sun Feb 12 13:05:44 1989 --- usersmtp.c Thu Feb 9 22:02:22 1989 *************** *** 72,87 **** ** creates connection and sends initial protocol. */ - jmp_buf CtxGreeting; - smtpinit(m, pvp) struct mailer *m; char **pvp; { register int r; ! EVENT *gte; char buf[MAXNAME]; - extern greettimeout(); /* ** Open the connection to the mailer. --- 72,84 ---- ** creates connection and sends initial protocol. */ smtpinit(m, pvp) struct mailer *m; char **pvp; { register int r; ! time_t SavedReadTimeout; char buf[MAXNAME]; /* ** Open the connection to the mailer. *************** *** 133,145 **** ** happen. */ - if (setjmp(CtxGreeting) != 0) - goto tempfail; - gte = setevent((time_t) 300, greettimeout, 0); SmtpPhase = "greeting wait"; setproctitle("%s %s: %s", CurEnv->e_id, CurHostName, SmtpPhase); r = reply(m); ! clrevent(gte); if (r < 0 || REPLYTYPE(r) != 2) goto tempfail; --- 130,142 ---- ** happen. */ SmtpPhase = "greeting wait"; setproctitle("%s %s: %s", CurEnv->e_id, CurHostName, SmtpPhase); + SavedReadTimeout = ReadTimeout; + if (ReadTimeout > 300) + ReadTimeout = 300; r = reply(m); ! ReadTimeout = SavedReadTimeout; if (r < 0 || REPLYTYPE(r) != 2) goto tempfail; *************** *** 218,231 **** unavailable: smtpquit(m); return (EX_UNAVAILABLE); - } - - - static - greettimeout() - { - /* timeout reading the greeting message */ - longjmp(CtxGreeting, 1); } /* ** SMTPRCPT -- designate recipient. --- 215,220 ----