[news.software.b] Strange Core Dumps

bblue@crash.cts.com (Bill Blue) (12/10/87)

Perhaps someone can help me with a very strange problem I've been having
for some time now.  I'm running 2.11 at patchlevel 12 on a Symmetric
machine with 4.2 BSD.  inews is set for spooling in and out using the
.rnews directory.  I'm receiving a full feed, batched and compressed.

Several months ago I began having strange panics on the system
(noswapspace) and resulting reboots.  Nothing had been changed, it just
started happening. After watching for patterns for a time, I found that
this panic only occurred during times of uncompressing and unbatching
for posting locally.  I began watching the unbatching/posting operations
with a process monitor (a program called 'top' which displays processes,
priorities, memory usage, etc) and found that every so often one of the
inews processes would start consuming huge amounts of memory.  From a
typical of 150-180k it would start creeping up to several megabytes.
Once it exceeded my disk swap space (7mb) the system would die as
described above.  Occasionally it would dump core at the 5mb point and
continue, but not often.  I found that if I killed the process before it
got to that point everything would clean up nicely and I'd still be up.
Then, if I restarted the unbatching process on the *same* batch file, it
would process normally.  In other words, the problem does not replicate
given the same sets of articles. It somehow appears to be cumulative in
nature.

It appears as though I'm not alone in this behavior, either.  In talking
to two neighboring sites (my feed and another that I full-feed to) I
found that they have noticed very large core files laying around from
time to time, but have just deleted them since nothing serious appeared
to have happened.  This problem is obviously affecting different
machines differently, having the most profound effect on my smaller
machine.

Anyway, I've continued watching for patterns -- anything that would help
me localize what the trigger is.  And I think I've found something.
In my last five runaways (core dumps or panics), all of which happened
while inews was unbatching, after examining the .ar* and .in* files that
were being processed at the time of the runaway, I find that in EVERY
case they are BITNET articles posted to comp.lang.modula2!  Dum da dum dum....

Now, hopefully, someone will be given a clue from this behavior that
will result in a patch that will fix this obnoxious problem once and for
all. I have compared the articles being processed to others from the
same source that did post correctly, and don't notice anything peculiar
between them.  I suppose the most peculiar thing about those articles in
the first place is the Message-ID number, but I don't see anything in
the source code that suggests any problem there.

I've enclosed three of the comp.lang.modula2 articles (with shortened
text).

Please help!

--Bill


>Path: crash!ncr-sd!hp-sdd!hplabs!ucbvax!GWUVM.BITNET!MFELDMAN
>From: MFELDMAN@GWUVM.BITNET (Mike Feldman)
>Newsgroups: comp.lang.modula2
>Subject: Re: Modula-2 Texts
>Message-ID: <INFO-M2%87120800522300@UCF1VM>
>Date: 8 Dec 87 03:52:00 GMT
>Sender: daemon@ucbvax.BERKELEY.EDU
>Reply-To: Info-Modula2 Distribution List <INFO-M2%UCF1VM.bitnet@jade.berkeley.edu>
>Organization: The ARPA Internet
>Lines: 2
>
>As a general M2 book with an _excellent_ discussion of PROCESSES,
>try Ford and Wiener, "Software Engineering with Modula-2". Wiley.
>
>
>Path: crash!ncr-sd!hp-sdd!hplabs!ucbvax!UOGUELPH.BITNET!BOTCHAIR
>From: BOTCHAIR@UOGUELPH.BITNET (Alex Bewley)
>Newsgroups: comp.lang.modula2
>Subject: Re: Modula II on IBM PC with HALO graphics
>Message-ID: <INFO-M2%87120818260493@UCF1VM>
>Date: 8 Dec 87 23:10:49 GMT
>References: <glacier!jbn@labrea.stanford.edu>
>Sender: daemon@ucbvax.BERKELEY.EDU
>Reply-To: Info-Modula2 Distribution List <INFO-M2%UCF1VM.bitnet@jade.berkeley.edu>
>Organization: The ARPA Internet
>Lines: 13
>
>
>  As far as your concurrency question about async ports goes...
>
>   The communication libraries supplied with LogiTech are quite good.  The
>async routine is run off the IOTRANSFER command.  So it will interrupt your
>[stuff deleted]
>
>
>Path: crash!ncr-sd!hp-sdd!hplabs!ucbvax!NMSUVM1.BITNET!MCSMAL
>From: MCSMAL@NMSUVM1.BITNET
>Newsgroups: comp.lang.modula2
>Subject: (none)
>Message-ID: <INFO-M2%87120718485852@UCF1VM>
>Date: 8 Dec 87 00:54:26 GMT
>Sender: usenet@ucbvax.BERKELEY.EDU
>Reply-To: Info-Modula2 Distribution List <INFO-M2%UCF1VM.bitnet@jade.berkeley.edu>
>Organization: The ARPA Internet
>Lines: 14
>
>I have the Microbotics Multifunction module and the 68881 coprocessor
>for my Amiga.  I have been trying to access the 2 IEEE libraries
>(supplied by Microbotics) from Benchmark Modula-2. I have written
>modules in both Modula-2 and Assembly that attempt to call these library
>[stuff deleted]
>
>

pep@princeton.Princeton.EDU (Pat Parseghian) (12/12/87)

Your problem has a familiar ring . . .
Our mail/news gateway occasionally runs out of swap space and we haven't been
lucky enough to track it down.  It doesn't panic, it limps along - but it's
hard to get evidence when "ps" dumps core (Bus error) and utilities like "top"
and "systat" won't run (Not enough memory).

We're running 4.3BSD+NFS (Mt. Xinu) on a Microvax II.  We're running 2.11,
patch 14, but we've seen this with patch 8).  We went from patch 8 to patch 14
a week ago; whether we saw the problem prior to patch 8, I don't recall.  We
have 16 MB of memory and about the same amount of swap space.

What does your core dump suggest???

I'd like to supplement your observations:
- The offending articles are the only ones in my history file with a "%" in a
  Message-ID.
- One of the articles (<INFO-M2%87120818260493@UCF1VM>) has a References line
  that is not a valid Message-ID (to the best of my understanding).
- That particular article is still on my system as /usr/spool/news/.ar006603,
  which is linked to /usr/spool/news/comp/lang/modula2/576.

This last observation was most interesting to me, because:
- The article is in my history file.
- We sent the article successfully to our netnews neighbors (I found it in a
  spool directory on one of our neighboring machines, with a path indicating
  it came through us).
- So why didn't inews unlink /usr/spool/news/.ar006603?  Probably because it
  got hung before calling xxit().  The article arrived on our system around
  21:43 on 12/8; at 9:49 on 12/9 we ran out of memory/swap space.  At 10:42 I
  rebooted.  I'll bet that inews was running, but I can't confirm that.

Pat Parseghian, Dept. of Computer Science, Princeton University, (609) 452-6261

bblue@crash.cts.com (Bill Blue) (12/12/87)

In article <7961@princeton.Princeton.EDU> pep@Princeton.EDU (Pat Parseghian) writes:
>Your problem has a familiar ring . . .
>Our mail/news gateway occasionally runs out of swap space and we haven't been
>lucky enough to track it down.  It doesn't panic, it limps along - but it's
>hard to get evidence when "ps" dumps core (Bus error) and utilities like "top"
>and "systat" won't run (Not enough memory).
>
>We're running 4.3BSD+NFS (Mt. Xinu) on a Microvax II.  We're running 2.11,
>patch 14, but we've seen this with patch 8).  We went from patch 8 to patch 14
>a week ago; whether we saw the problem prior to patch 8, I don't recall.  We
>have 16 MB of memory and about the same amount of swap space.

I too saw this behavior at patch level #8.

>What does your core dump suggest???

That things were very scrambled.  The stack trace revealed only one
call, 'pad()+02f'.  I looked all through the code (actually grep'd) for
any sort of a routine called pad() to no avail, so just assumed that
stack display was erroneous.

>I'd like to supplement your observations:
>- The offending articles are the only ones in my history file with a "%" in a
>  Message-ID.
>- One of the articles (<INFO-M2%87120818260493@UCF1VM>) has a References line
>  that is not a valid Message-ID (to the best of my understanding).
>- That particular article is still on my system as /usr/spool/news/.ar006603,
>  which is linked to /usr/spool/news/comp/lang/modula2/576.

I missed that one completely.  The three .ar files that I have do in fact
have correctly posted counterparts (originally links) in comp.lang.modula2
here.  I completely missed the link...

In checking all the messages in my spool directory for comp.lang.modula2,
the three messages for which I have saved .ar files, are the only files in
the directory that use that format in the Message-ID line.  Others may
have it in a References line, though.

>This last observation was most interesting to me, because:
>- The article is in my history file.

Yep.

>- We sent the article successfully to our netnews neighbors (I found it in a
>  spool directory on one of our neighboring machines, with a path indicating
>  it came through us).

Same here.

>- So why didn't inews unlink /usr/spool/news/.ar006603?  Probably because it
>  got hung before calling xxit().  The article arrived on our system around
>  21:43 on 12/8; at 9:49 on 12/9 we ran out of memory/swap space.  At 10:42 I
>  rebooted.  I'll bet that inews was running, but I can't confirm that.

The times on each of my .ar files are within three minutes of each runaway or
crash that I experience.

I originally reported that rerunning the same batch did not produce the
same results.  But in thinking back about this, I'm not positive this
is so, because I'm not sure that the batch that was running at the time
of the crash was interrupted in the middle and was actually left there
for the next run.  So it looks as if something about the processing of
the Message-ID line for those particular articles is corrupting code in
another location, which doesn't show up until a later time -- like maybe
when that batch ends.

--Bill

dean@ndmath.UUCP (Dean Alvis) (12/13/87)

In article <2122@crash.cts.com>, bblue@crash.cts.com (Bill Blue) writes:
> [...]
> In my last five runaways (core dumps or panics), all of which happened
> while inews was unbatching, after examining the .ar* and .in* files that
> were being processed at the time of the runaway, I find that in EVERY
> case they are BITNET articles posted to comp.lang.modula2!  Dum da dum dum....
There have been similar problems here recently with BITNET postings to
comp.lang.modula2. We haven't had core dumps, but the articles cause their
rnews process to die (running "lastcomm rnews" shows an "X" for the
corresponding process), and aren't sent to downstream sites. Also, there are
residual ".arXXXXXX" and ".inXXXXXX" files. The environment is a Sun-2
running 4.2BSD, with 2.11news at patch level 14.

I'd appreciate e-mail if anyone figures this out.
-- 
dean@ndmath.UUCP
Dean Alvis, Math. Dept., Univ. of Notre Dame, Notre Dame IN, 46556

gnu@hoptoad.uucp (John Gilmore) (12/13/87)

pep@princeton.Princeton.EDU (Pat Parseghian) wrote:
> - The offending articles are the only ones in my history file with a "%" in a
>   Message-ID.
> - One of the articles (<INFO-M2%87120818260493@UCF1VM>) has a References line
>   that is not a valid Message-ID (to the best of my understanding).

It occurs to me that if somehow a string like this was passed to "printf"
or maybe "scanf", the big number after the % might cause havoc, like an
attempt to malloc() a large amount of memory.

I recommend that somebody with the bug (I don't run news 2.11 here) grab
one of the faulty articles and feed it back into rnews, running rnews under
adb or dbx, with "trace malloc" or a breakpoint there.
-- 
{pyramid,ptsfa,amdahl,sun,ihnp4}!hoptoad!gnu			  gnu@toad.com
		"Watch me change my world..." -- Liquid Theatre

dean@ndmath.UUCP (Dean Alvis) (12/14/87)

In article <508@ndmath.UUCP>, dean@ndmath.UUCP (Dean Alvis) writes:
> In article <2122@crash.cts.com>, bblue@crash.cts.com (Bill Blue) writes:
> > [...]
> There have been similar problems here recently with BITNET postings to
> comp.lang.modula2. We haven't had core dumps, but the articles cause their
> rnews process to die (running "lastcomm rnews" shows an "X" for the
> corresponding process), and aren't sent to downstream sites. Also, there are
                              ^^^^^^ ^^^^

after checking downstream, it appears the articles *were* sent. if i
remember correctly, there were no corresponding "sent" lines in 
/usr/lib/news/log, however. if this is correct, it suggest the problem is
in broadcast() or transmit() (in ifuncs.c).

	dean
-- 
dean@ndmath.UUCP
Dean Alvis, Math. Dept., Univ. of Notre Dame, Notre Dame IN, 46556

battan@tc.fluke.COM (Jim Battan) (12/15/87)

In article <2122@crash.cts.com>, bblue@crash.cts.com (Bill Blue) writes
about how rnews is leaving around articles and core dumps when processing
certain articles having Message-ID's of the form:
> >Message-ID: <INFO-M2%87120800522300@UCF1VM>

John Gilmore has the right idea.  This is a problem with an sprintf
overflowing due to not having enough memory and/or arguments.  The
% in the Message-ID is sent through sprintf when the article's
forwarding to other sites gets recorded in the log file.  Specifically,
in broadcast() (ifuncs.c), "sentbuf" gets the Message-ID and a list
of sites the article is sent to.  This is then sprintf-ed through
in log(), where it bombs.  How articles then get sent to the other sites
is mysterious, but I don't understand enough about the flow of the
articles to be sure.  Below, please find a suitable context diff that
can be sent thru patch(1L).  I've mailed the diff to Rick Adams for
inclusion in patch n.  Since the articles get posted correctly even
with this bug, you could wait for the official patch if you're the
cautious type.  Another possible fix could have been to the call to
log() in broadcast(), ala'
	log("%s", sentbuf);
but my fix will (hopefully) have the same effect.  Note that I haven't
yet found an incoming article to really test this on.  And a general
disclaimer:  This works for me; your mileage may vary.

*** ifuncs.c.orig	Mon Dec 14 09:10:41 1987
--- ifuncs.c	Mon Dec 14 09:03:29 1987
***************
*** 113,119 ****
  		xerror("Cannot reread article");
  	(void) fclose(fp);
  
! 	(void) strcpy(sentbuf, h.ident);
  	(void) strcat(sentbuf, " sent to ");
  	sentsys = index(sentbuf, '\0');
  	nsent = 0;
--- 117,134 ----
  		xerror("Cannot reread article");
  	(void) fclose(fp);
  
! 	/* Double "%"'s in the header's ident, since log() would try to	*/
! 	/* interpret them in sentbuf.  Articles posted by some BITNET sites */
! 	/* occasionally have these types of Message-IDs.		*/
! 	/* e.g. <INFO-M2%87120818260493@UCF1VM>				*/
! 	hptr = h.ident;
! 	sptr = sentbuf;
! 	while (*hptr != '\0') {
! 		if (*hptr == '%')
! 			*sptr++ = '%';
! 		*sptr++ = *hptr++;
! 	}
! 	*sptr = '\0';
  	(void) strcat(sentbuf, " sent to ");
  	sentsys = index(sentbuf, '\0');
  	nsent = 0;
-- 
Jim Battan     Domain: battan@tc.fluke.COM  Voice: +1 206 356 6469
UUCP: {uw-beaver,decvax!microsof,ucbvax!lbl-csam,allegra,sun}!fluke!battan

bblue@crash.cts.com (Bill Blue) (12/17/87)

In article <2479@fluke.COM> battan@tc.fluke.COM (Jim Battan) writes:
>In article <2122@crash.cts.com>, bblue@crash.cts.com (Bill Blue) writes
>about how rnews is leaving around articles and core dumps when processing
>certain articles having Message-ID's of the form:
>> >Message-ID: <INFO-M2%87120800522300@UCF1VM>
>
>John Gilmore has the right idea.  This is a problem with an sprintf
>overflowing due to not having enough memory and/or arguments.  The
>% in the Message-ID is sent through sprintf when the article's
>forwarding to other sites gets recorded in the log file.  Specifically,
>in broadcast() (ifuncs.c), "sentbuf" gets the Message-ID and a list
>of sites the article is sent to.  This is then sprintf-ed through
>in log(), where it bombs.  How articles then get sent to the other sites
>is mysterious, but I don't understand enough about the flow of the
>articles to be sure.  Below, please find a suitable context diff that
>can be sent thru patch(1L).  I've mailed the diff to Rick Adams for
>inclusion in patch n.  Since the articles get posted correctly even
>with this bug, you could wait for the official patch if you're the
>cautious type.  Another possible fix could have been to the call to
>log() in broadcast(), ala'
>	log("%s", sentbuf);
>but my fix will (hopefully) have the same effect.  Note that I haven't
>yet found an incoming article to really test this on.  And a general
>disclaimer:  This works for me; your mileage may vary.
[patch code deleted]

What I don't understand, Jim, is why you'd go to all the bother adding
that code segment to double %'s, when the above example would accomplish
the same thing?

Anyway, I made the syntax change in broadcast() (in ifuncs.c) of the
log() call as shown above, and it seems to be happy!  I had saved a
couple of the problematic articles, and piped one through my original
inews, and then again through the one with the above fix.  As expected,
the former core dumped (and then some), while the latter worked just
fine.

I encourage everyone to install this fix, and for it to be included
in the next patch.

Thanks to Jim and others who responded to this problem.

--Bill

battan@tc.fluke.COM (Jim Battan) (12/19/87)

In article <2152@crash.cts.com> bblue@crash.CTS.COM (Bill Blue) writes:
>In article <2479@fluke.COM> battan@tc.fluke.COM (Jim Battan) writes:
>What I don't understand, Jim, is why you'd go to all the bother adding
>that code segment to double %'s, when the above example would accomplish
>the same thing?

I didn't notice the other possible fix until I had already added the
code to double up the %'s.  I was following the adage "First, get it
working; then, get it right." (I know, "make it fast".)

In fact, in hindsight, I would recommend modifying the log() call
instead of adding the code I did, since changing the call would be more
straight-forward and fit in with the paradigm of log().
-- 
Jim Battan     Domain: battan@tc.fluke.COM  Voice: +1 206 356 6469
UUCP: {uw-beaver,decvax!microsof,ucbvax!lbl-csam,allegra,sun}!fluke!battan

dean@ndmath.UUCP (Dean Alvis) (12/20/87)

In article <3618@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
> [...]
> It occurs to me that if somehow a string like this was passed to "printf"
> or maybe "scanf", the big number after the % might cause havoc, like an
> [...]

The problem is the call to log() near the end of broadcast() in ifuncs.c. In
broadcast(), the Message-ID string is copied into sentbuf, then the names of
systems the article is sent to are appended. When leaving broadcast(), the
line "log(sentbuf)" is s'posed to record a "sent to" line in the log file.
However, the first argument to log() is a format string (if I remember
correctly), so the "%" confuses the fprintf (?) in log(). This explains why
the articles were successfully sent, but no "sent to" line was recorded.

As a test, I sent one of the offending articles through rnews, with its
original header, and the same problem occured. After the Message-ID line
was fixed (ie.the "%" removed), it was sent through rnews again, and no
problem occured.

Perhaps one fix would be to use "log("%s",sentbuf)" - I can't claim to have
tried this yet, though.

dean
-- 
dean@ndmath.UUCP
Dean Alvis, Math. Dept., Univ. of Notre Dame, Notre Dame IN, 46556