[comp.bugs.4bsd] longjmp botches in sendmail on 4.3+NFS

whm@megaron.UUCP (12/12/86)

I had occasionally noticed a core file in sendmail's queue directory, but
had never thought much of it; I'd just remove it.  I then got to wondering
about how often sendmail core dumped and found that it happens more than
one might think (hope?).  In particular, submissions to a large mailing
list here (~100 non-local addresses) often produce several such core dumps.

Examination of the dumps usually reveals that sendmail got a longjmp botch.
According to my count, there are four longjmps in sendmail.  Two of them
are longjmp(TopFrame) and these are only done if QuickAbort is != 0.  The
core files show that QuickAbort is 0, so that seems to eliminate those
two longjmps as candidates for the botches in question.

The third longjmp is invoked from the smtpinit routine -- if the SMTP greeting
isn't seen within five minutes of getting a connection, an event goes off and
the longjmp is performed.

The fourth longjmp is called due to read timeout in sfgets -- the fgets-like
routine that takes steps to not get hung.  In most of the core files,
SmtpState is SMTP_OPEN, which implies that the longjmp that's blowing
up is the fourth one, in sfgets.  Popular values for SmtpPhase are
"user open", "greeting wait", and "DATA wait".

The sites involved on the far end are typically Arpanet hosts that we
exchange mail with on a regular basis.

I'm fuzzy on this, but I think the usual reason for a botch in longjmp is
that the routine that did the associated setjmp has already returned.  In
both the two likely longjmps mentioned, it seems unlikely that either
routine with the setjmp call could return before the longjmp is done.

If anyone has any suggestions on what the problem might be, I'd like to
hear them.  I think my step is to put some debugging stuff in a version of
longjmp in order to try to narrow down the problem some more.

					Bill Mitchell
					whm@arizona.edu
					{allegra,cmcl2,ihnp4,noao}!arizona!whm

jis@mit-trillian.MIT.EDU (Jeffrey I. Schiller) (12/18/86)

	The problem is caused by the two nested setjumps. Basically
what happens is that smtpinit() sets up a timer to go off after five
minutes (if it doesn't get a greeting). It then calls reply() which
ultimately calls sfgets(). sfgets sets up a timer (usually 2 hours) to
go off if no data is received (ie. you are in a collect and no data
comes in after 2 hours). The code in sfgets does a setjmp, sets a
timer (which will do a longjmp) and does the read. If the read
completes the timer is removed... HOWEVER if the 5 minute timer goes
off in smtpinit, the stack frame of sfgets is abandoned with the timer
still active.

	Now if the same sendmail process is around when that timer
goes off (ie. in two hours), which will typically only happen on LARGE
mailing lists, you get a longjmp botch.

	I found this bug a few weeks ago (with a mailing list of about
~250 recipients). I fixed it by changing the code in smtpinit to NOT
SET A TIMER, but to instead change the value of "ReadTimeout" (which
is the global variable that sfgets() uses to determine how long to
wait) to 5 minutes and then restore it later. Here is the comment in
my code:

	/*
	**  Get the greeting message.
	**	This should appear spontaneously.  Give it five minutes to
	**	happen.
        **
	**  JIS: We change the global variable ReadTimeout to be 5
	**      minutes. This variable is used by the lowlevel routine
	**      sfgets to determine how long to wait for input.
	**      when we get our greeting we return ReadTimeout to its
	**      previous state. IMPORTANT: The older code I replaced
	**      used a separate timeout (via a setjmp and longjmp)
	**      this LOSES REAL BIG if the 5 minute timeout goes off
	**      for then sfgets gets its stack unwound and leaves
	**      a lingering event that will eventually cause a longjmp
	**      to some ancient stack history, sendmail then dies horribly.
	**      This usually happens only when dealing with large mailing
	**      lists ("xpert" in this case > 200 recipients), which is
	**      the LAST place you want to dump core, for then the queue
	**      files are out of date and LOTS of people get a duplicate
	**      copy of the message that was in progress.
	*
	*/

	Btw. Another unrelated bug just discovered yesterday is that
if you have a LARGE number of recipients at one destination (like
wiscvm or seismo) then syslog() may get called with a line greater
then 1024 characters.... and blamo! core dump. This bug is really
in the syslog(3) routine, not sendmail itself...

			-Jeff