whm@megaron.UUCP (12/12/86)
I had occasionally noticed a core file in sendmail's queue directory, but had never thought much of it; I'd just remove it. I then got to wondering about how often sendmail core dumped and found that it happens more than one might think (hope?). In particular, submissions to a large mailing list here (~100 non-local addresses) often produce several such core dumps. Examination of the dumps usually reveals that sendmail got a longjmp botch. According to my count, there are four longjmps in sendmail. Two of them are longjmp(TopFrame) and these are only done if QuickAbort is != 0. The core files show that QuickAbort is 0, so that seems to eliminate those two longjmps as candidates for the botches in question. The third longjmp is invoked from the smtpinit routine -- if the SMTP greeting isn't seen within five minutes of getting a connection, an event goes off and the longjmp is performed. The fourth longjmp is called due to read timeout in sfgets -- the fgets-like routine that takes steps to not get hung. In most of the core files, SmtpState is SMTP_OPEN, which implies that the longjmp that's blowing up is the fourth one, in sfgets. Popular values for SmtpPhase are "user open", "greeting wait", and "DATA wait". The sites involved on the far end are typically Arpanet hosts that we exchange mail with on a regular basis. I'm fuzzy on this, but I think the usual reason for a botch in longjmp is that the routine that did the associated setjmp has already returned. In both the two likely longjmps mentioned, it seems unlikely that either routine with the setjmp call could return before the longjmp is done. If anyone has any suggestions on what the problem might be, I'd like to hear them. I think my step is to put some debugging stuff in a version of longjmp in order to try to narrow down the problem some more. Bill Mitchell whm@arizona.edu {allegra,cmcl2,ihnp4,noao}!arizona!whm
jis@mit-trillian.MIT.EDU (Jeffrey I. Schiller) (12/18/86)
The problem is caused by the two nested setjumps. Basically what happens is that smtpinit() sets up a timer to go off after five minutes (if it doesn't get a greeting). It then calls reply() which ultimately calls sfgets(). sfgets sets up a timer (usually 2 hours) to go off if no data is received (ie. you are in a collect and no data comes in after 2 hours). The code in sfgets does a setjmp, sets a timer (which will do a longjmp) and does the read. If the read completes the timer is removed... HOWEVER if the 5 minute timer goes off in smtpinit, the stack frame of sfgets is abandoned with the timer still active. Now if the same sendmail process is around when that timer goes off (ie. in two hours), which will typically only happen on LARGE mailing lists, you get a longjmp botch. I found this bug a few weeks ago (with a mailing list of about ~250 recipients). I fixed it by changing the code in smtpinit to NOT SET A TIMER, but to instead change the value of "ReadTimeout" (which is the global variable that sfgets() uses to determine how long to wait) to 5 minutes and then restore it later. Here is the comment in my code: /* ** Get the greeting message. ** This should appear spontaneously. Give it five minutes to ** happen. ** ** JIS: We change the global variable ReadTimeout to be 5 ** minutes. This variable is used by the lowlevel routine ** sfgets to determine how long to wait for input. ** when we get our greeting we return ReadTimeout to its ** previous state. IMPORTANT: The older code I replaced ** used a separate timeout (via a setjmp and longjmp) ** this LOSES REAL BIG if the 5 minute timeout goes off ** for then sfgets gets its stack unwound and leaves ** a lingering event that will eventually cause a longjmp ** to some ancient stack history, sendmail then dies horribly. ** This usually happens only when dealing with large mailing ** lists ("xpert" in this case > 200 recipients), which is ** the LAST place you want to dump core, for then the queue ** files are out of date and LOTS of people get a duplicate ** copy of the message that was in progress. * */ Btw. Another unrelated bug just discovered yesterday is that if you have a LARGE number of recipients at one destination (like wiscvm or seismo) then syslog() may get called with a line greater then 1024 characters.... and blamo! core dump. This bug is really in the syslog(3) routine, not sendmail itself... -Jeff