[comp.sys.apollo] do you ever see this?

paul@FLEETWOOD.CC.UMICH.EDU ('da Kingfish) (12/05/87)
on your apollos (or in the log of machine that talks smtp to a sendmail
on an apollo), that is?

	   ----- Transcript of session follows -----
	>>> DATA
	<<< 500 Command unrecognized
	>>> QUIT
	<<< 500 Command unrecognized
	554 dennis@peanuts.nosc.mil... Remote protocol error: Bad file number

If so, you are probably running 9.6 (or maybe 9.5.*) and you have
"the fork bug."  There is a patch for it, I believe.  Do not spend
any time trying to fix sendmail.

In an earlier exchange with Dave Krowitz (I think my reply to
him was bounced back), we were saying something like this:

(Dave, I hope you don't mind me including your message!)

I thought his problem was that syslog was the culprit, because sendmail
might have trying to log something.  Syslog can fill up your disk in
one of two ways:  Either by creating a giant `node_data/proc_dump,
which it can do if tcp_server dies.  proc_dump grows every time a
syslog is tried, because of some error having to do with a bad fork, or
pgm_$invoke.  or it will syslog ok, and syslog a zillion of its own
internal error messages about a bad select (that's just a syslog
problem, not really apollo's).

But that wasn't his problem.  He said (in part)

"we don't run the syslog daemon at all ..."

(i said "don't start" in my reply to him, but that is just my own
opinion.  I run the 4.3 syslogger on my node)

He said:
" ... typical dump looks like:"

Dump #30.

Dump Status:    0E778014: status E778014 (DOMAIN Diagnostics)
Dump Time:      1987/11/24.16:39 (EST)
Proc2 UID:      38AC94BD.60002D91
Prog UID:       349B3C61.20002D91
Prog Name:      /com/sh

Fault Diagnostic Information
Fault Status   = 00000000: 
User Fault PC  = 00000000
A6-A7:           00000000 00000000
Supervisor ECB = 00000000
Supervisor SR  = 0000
Supervisor PC  = 00000000

"... and ..."

"an 'illegal address' fault such as the one below:"

Dump #8.

Dump Status:    0E778014: status E778014 (DOMAIN Diagnostics)
Dump Time:      1987/11/23.18:27 (EST)
Proc2 UID:      38A7EDF0.B0002D91
Prog UID:       349B3C61.20002D91
Prog Name:      /com/sh

Fault Diagnostic Information
Fault Status   = 00040004: reference to illegal address (OS/MST manager)
Access Addr    = 02696368
IR             = 0000
Acc. Info      = 0000
User Fault PC  = 0E9641FE
D0-D3:           00000040 FFFFFFFC 0000000C 00000003
D4-D7:           0000003C 0E7C5012 00000001 00000000
A0-A3:           72696368 0E738024 0001F5EC 00026F88
A4-A7:           0001CAFC 0E947FF4 0E7C4F14 0E7C4EE8
Supervisor ECB = 00000000
Supervisor SR  = 0000
Supervisor PC  = 00000000
In routine "malloc" line 189
Called from "xalloc" line 142
Called from "chompheader" line 130
Called from "collect" line 115
Called from "smtp" line 269
Called from "main" line 583
Called from "UNIX_$MAIN" line 190
Called from "<apollo_c_startup>" line 31999
Called from "UID 38A2A9BF.F0002D91"
Called from "PGM_$LOAD_RUN" line 453

Now, my response to him, which Dave may or may not have alreay seen,
went *something* like this:

Ooooohhhhhhh.  You have something entirely different.  You may have the
"fork bug".  basically, if you call sbrk (via malloc, let's say) after
a fork, you may get pages that were not zeroed.  malloc assumes zeroed
pages from sbrk, and its (malloc's) internal data structures get
screwed up.

that's why the traceback shows the crash in malloc.

because sendmail forks for smtp commands, it gets it in spades.  it
really showed up in 9.6, but existed prior to that, i guess.  there is
a 9.6 patch, it's fixed in 9.7.

tell apollo you have the "sendmail fork bug" and that should make sense
to them.  we run a lot of mail through some apollos, and noticed it
pretty quickly.  people in apollo's OS group, whose names I won't
mention here to keep their phone call and mail volume down, did a 
real nice (and fast) job, telling me a workaround, and providing a fix.

So, take advantage of that, and get the fix if you need it.

If you already have that patch, you found a new problem.  congrats -:)

--paul