paul@FLEETWOOD.CC.UMICH.EDU ('da Kingfish) (12/05/87)
on your apollos (or in the log of machine that talks smtp to a sendmail on an apollo), that is? ----- Transcript of session follows ----- >>> DATA <<< 500 Command unrecognized >>> QUIT <<< 500 Command unrecognized 554 dennis@peanuts.nosc.mil... Remote protocol error: Bad file number If so, you are probably running 9.6 (or maybe 9.5.*) and you have "the fork bug." There is a patch for it, I believe. Do not spend any time trying to fix sendmail. In an earlier exchange with Dave Krowitz (I think my reply to him was bounced back), we were saying something like this: (Dave, I hope you don't mind me including your message!) I thought his problem was that syslog was the culprit, because sendmail might have trying to log something. Syslog can fill up your disk in one of two ways: Either by creating a giant `node_data/proc_dump, which it can do if tcp_server dies. proc_dump grows every time a syslog is tried, because of some error having to do with a bad fork, or pgm_$invoke. or it will syslog ok, and syslog a zillion of its own internal error messages about a bad select (that's just a syslog problem, not really apollo's). But that wasn't his problem. He said (in part) "we don't run the syslog daemon at all ..." (i said "don't start" in my reply to him, but that is just my own opinion. I run the 4.3 syslogger on my node) He said: " ... typical dump looks like:" Dump #30. Dump Status: 0E778014: status E778014 (DOMAIN Diagnostics) Dump Time: 1987/11/24.16:39 (EST) Proc2 UID: 38AC94BD.60002D91 Prog UID: 349B3C61.20002D91 Prog Name: /com/sh Fault Diagnostic Information Fault Status = 00000000: User Fault PC = 00000000 A6-A7: 00000000 00000000 Supervisor ECB = 00000000 Supervisor SR = 0000 Supervisor PC = 00000000 "... and ..." "an 'illegal address' fault such as the one below:" Dump #8. Dump Status: 0E778014: status E778014 (DOMAIN Diagnostics) Dump Time: 1987/11/23.18:27 (EST) Proc2 UID: 38A7EDF0.B0002D91 Prog UID: 349B3C61.20002D91 Prog Name: /com/sh Fault Diagnostic Information Fault Status = 00040004: reference to illegal address (OS/MST manager) Access Addr = 02696368 IR = 0000 Acc. Info = 0000 User Fault PC = 0E9641FE D0-D3: 00000040 FFFFFFFC 0000000C 00000003 D4-D7: 0000003C 0E7C5012 00000001 00000000 A0-A3: 72696368 0E738024 0001F5EC 00026F88 A4-A7: 0001CAFC 0E947FF4 0E7C4F14 0E7C4EE8 Supervisor ECB = 00000000 Supervisor SR = 0000 Supervisor PC = 00000000 In routine "malloc" line 189 Called from "xalloc" line 142 Called from "chompheader" line 130 Called from "collect" line 115 Called from "smtp" line 269 Called from "main" line 583 Called from "UNIX_$MAIN" line 190 Called from "<apollo_c_startup>" line 31999 Called from "UID 38A2A9BF.F0002D91" Called from "PGM_$LOAD_RUN" line 453 Now, my response to him, which Dave may or may not have alreay seen, went *something* like this: Ooooohhhhhhh. You have something entirely different. You may have the "fork bug". basically, if you call sbrk (via malloc, let's say) after a fork, you may get pages that were not zeroed. malloc assumes zeroed pages from sbrk, and its (malloc's) internal data structures get screwed up. that's why the traceback shows the crash in malloc. because sendmail forks for smtp commands, it gets it in spades. it really showed up in 9.6, but existed prior to that, i guess. there is a 9.6 patch, it's fixed in 9.7. tell apollo you have the "sendmail fork bug" and that should make sense to them. we run a lot of mail through some apollos, and noticed it pretty quickly. people in apollo's OS group, whose names I won't mention here to keep their phone call and mail volume down, did a real nice (and fast) job, telling me a workaround, and providing a fix. So, take advantage of that, and get the fix if you need it. If you already have that patch, you found a new problem. congrats -:) --paul