[news.software.nn] nnmaster dies

lipton@odin.m2c.org (Gary Lipton) (02/12/90)

I had nn running fine for a couple of weeks (under ULTRIX)
but now I can't seem to keep nnmaster running.  It goes for a while,
but it doesn't finish updating the groups.  Nothing that I know
of has changed.  I've tried reinstalling, but it doesn't help.
Any suggestions?


Gary Lipton - Massachusetts Microelectronics Center    ##   ## 2 #####
lipton@m2c.org (CSNET) ****************************    ### ###  ##    
lipton%m2c.org@relay.cs.net (Internet) ************    ## # ##  ##
{harvard,bu-cs,frog,ulowell}!m2c!lipton (UUCP) ****    ##   ##   #####

raymond@ptolemy.arc.nasa.gov (Eric A. Raymond) (02/13/90)

lipton@odin.m2c.org (Gary Lipton) writes:

>I had nn running fine for a couple of weeks (under ULTRIX)
>but now I can't seem to keep nnmaster running.  It goes for a while,
>but it doesn't finish updating the groups.  Nothing that I know
>of has changed.  I've tried reinstalling, but it doesn't help.
>Any suggestions?


Me too (on a Sun).

-- 
Eric A. Raymond  (raymond@ptolemy.arc.nasa.gov)
G7 C7 G7 G#7 G7 G+13 C7 GM7 Am7 Bm7 Bd7 Am7 C7 Do13 G7 C7 G7 D+13: Elmore James

davison@drivax.UUCP (Wayne Davison) (02/13/90)

lipton@odin.m2c.org (Gary Lipton) wrote:
} I had nn running fine for a couple of weeks (under ULTRIX)
} but now I can't seem to keep nnmaster running.  It goes for a while,
} but it doesn't finish updating the groups.  Nothing that I know
} of has changed.  I've tried reinstalling, but it doesn't help.
} Any suggestions?

It sounds like you've either got an old version of nn (pre-v6.3.6), or
there is another database bug to be found.  In either case, you can track
down the problem using the following tactics:

Do an "ls -t" of the nn/db/DATA directory to discover which data file was
last updated by nnmaster.  The group causing the trouble will be the first
group with new news after this one.  For example, if the list shows:

	/usr/spool/nn/db/DATA>l -t
	total 1426
	-rw-rw-r--  1 news          136 Feb 12 17:56 277.x
	-rw-rw-r--  1 news         2183 Feb 12 17:56 277.d

then the last group to be sucessfully updated was group 277 (0-relative).
Check the GROUPS file in nn/db on line 278 (1-relative, it's a text file)
to discover which one it is, and which groups immediately follow it.  You
can then cd to the offending group, and examine the latest news that came
in for corruption.  The most likely cause is a header that is too long or
mangled in some way (the body of the article shouldn't matter).

If you have a pre-6.3.6 version of nn, you still have a bug in the name
compression routine that will cause nn to crash when the name header is
longer than 256 (or so) bytes.  You can get an update to fix it so that
this doesn't happen in the future.
-- 
Wayne Davison          \  /| / /| \/ /| /(_)         davison@drivax.UUCP
                      (_)/ |/ /\| / / |/  \          ...!amdahl!drivax!davison

hankin@sauron.osf.org (Scott Hankin) (02/13/90)

raymond@ptolemy.arc.nasa.gov (Eric A. Raymond) writes:

>lipton@odin.m2c.org (Gary Lipton) writes:

>>I had nn running fine for a couple of weeks (under ULTRIX)
>>but now I can't seem to keep nnmaster running.  It goes for a while,
>>but it doesn't finish updating the groups.  Nothing that I know
>>of has changed.  I've tried reinstalling, but it doesn't help.
>>Any suggestions?


>Me too (on a Sun).

    I've had this happen a number of times with older versions of nn.  The
    problem is that when nn crashes for any reason, you never find out about
    it.  The solution I found was to compile nn with the -g flag, and use dbx
    to determine where the crash occurred.  In my case, I found a bug which
    had already been fixed in the next version, which had been available for
    some time.  So my guess is that you either 1) don't have the most recent
    version or 2) have found a new bug, which I'm sure Kim would welcome
    hearing about (and I'm sure he'd welcome your fix as well!).  Go to it.

- Scott

------------------------------
Scott Hankin  (hankin@osf.org)
Open Software Foundation

bvk@hhb.UUCP (Brett Kuehner) (02/14/90)

lipton@odin.m2c.org (Gary Lipton) writes:

>I had nn running fine for a couple of weeks (under ULTRIX)
>but now I can't seem to keep nnmaster running.  It goes for a while,
>but it doesn't finish updating the groups.  Nothing that I know
>of has changed.  I've tried reinstalling, but it doesn't help.
>Any suggestions?

This happens to me as well (under Ultrix). I'll start nnmaster (-r -C),
it'll run happily for a while (~10 minutes), and then it'll vanish
without a complaint. I'll restart it, and it'll do the same thing.
Several days later, it'll function fine, without any changes on my
part (perhaps expire is getting rid of an article that kills
nnmaster?). This has happened several times, and even a complete
rebuild doesn't seem to fix it (although it did once).

		Brett
--
Brett Kuehner, Racal-Redac, Mahwah, NJ
...!princeton!hhb!bvk
bvk%hhb@princeton.EDU
-- 
Brett Kuehner, Racal-Redac, Mawah, NJ
...!princeton!hhb!bvk
bvk%hhb@princeton.EDU

news@teda.UUCP (Teraida Newsadm) (02/15/90)

mrapple@quack.UUCP (Nick Sayer) writes:

>raymond@ptolemy.arc.nasa.gov (Eric A. Raymond) writes:

>>lipton@odin.m2c.org (Gary Lipton) writes:

>>>I had nn running fine for a couple of weeks (under ULTRIX)
>>>but now I can't seem to keep nnmaster running...

>>Me too (on a Sun).

I've seen this problem as well.  My solution is to run nnmaster from cron
every 15 minutes.  Fifteen minutes doesn't seem like much of a delay, and
when little new news has arrived it runs very quickly.  Perhaps a new
feature of nnmaster would be for it to fork a child to perform the actual
work.  This way if the child exits for any reason, the original nnmaster
is still running to run the update later.  This kind of strategy is
used by sendmail.

-- 
Teraida News Admin (aka Mikel Lechner)		UUCP:  news@teraida.UUCP
Teradyne EDA, Inc.
5155 Old Ironsides Drive
Santa Clara, Ca 95054

davison@drivax.UUCP (Wayne Davison) (02/16/90)

news@teda.UUCP (Teraida Newsadm) wrote:
[experiencing abnormal exits from nnmaster]
} My solution is to run nnmaster from cron every 15 minutes. 

Are you sure you're still getting the entire database updated?  The only
time I had a problem with nnmaster exiting was when an article with an
extremely long "From:" header was received.  When this occurs, NO groups
past the group with the problem article are EVER updated until the article
expires.  Running nnmaster from cron only updates part of the database.
You can check this by looking in your nn/db/DATA directory to see if there
is a group of high-numbered files that do not have recent dates.

(BTW, did anyone read my article on how to track down the offending article
so that you could fix the article??  Did anyone find anything?)

If there is some other reason that nn is exiting (besides the From-header
bug that was fixed in usenet v6.3.6 and ftp v6.3.8), a crashing nnmaster
means that there is something else to be fixed.  Help Kim track it down
if you can.  Upgrade your nn if possible so you don't get stung by bugs
that have already been fixed.
-- 
Wayne Davison           \  /| / /| \/ /| /(_)    davison%drivax@uts.amdahl.com
davison@drivax.UUCP    (_)/ |/ /\| / / |/  \         ...!amdahl!drivax!davison

iand@labtam.oz (Ian Donaldson) (02/17/90)

lipton@odin.m2c.org (Gary Lipton) writes:

>I had nn running fine for a couple of weeks (under ULTRIX)
>but now I can't seem to keep nnmaster running.  It goes for a while,
>but it doesn't finish updating the groups.  Nothing that I know
>of has changed.  I've tried reinstalling, but it doesn't help.
>Any suggestions?

We found that if nnmaster was being run, connecting to the remote
via NNTP, and the remote host dies, so does nnmaster.  It doesn't try
and re-establish the NNTP link.

A workaround was simply to run nnmaster from a script in a loop

	while true
	do
	    nnmaster <args>
	    sleep 600
	done

Then, if the remote host dies, nnmaster exits and doesn't get restarted
for 10 minutes... saves thrashing the local system with startup database
checks all the time.

Ian D

storm@texas.dk (Kim F. Storm) (02/19/90)

Several users have complained about nnmaster dying.  There were
definitely bugs in the pre 6.3.6/6.3.8 versions which could cause the
master to dies, either because of articles with bad headers, or due to
network problems.

If you run 6.3.6, the "bad header" problem should be fixed (but it
still exists in 6.3.7! - get patch 8).  And with 6.3.10, most of the
network problems should be solved as well, so when

iand@labtam.oz (Ian Donaldson) writes:

>We found that if nnmaster was being run, connecting to the remote
>via NNTP, and the remote host dies, so does nnmaster.  It doesn't try
>and re-establish the NNTP link.

I believe this has changed with version 6.3.8.  We made some efforts
to have nnmaster detect immediately that the nntp server dies, and
just go to sleep immediately and wait for one -r period (you probably
don't use less than 10 minutes intervals to stay on good terms with
your NNTP server :-)

If this happened with 6.3.7 or later, I would certainly like to hear
about it.

And a small gripe:  When somebody reports a problem with nn, please
always mention the version, e.g. 6.3.7 meaning release 6.3, patchlevel 7.
It makes it much easier for me to distinguish between new and old bugs.
Thanks!

-- 
Kim F. Storm        storm@texas.dk        Tel +45 429 174 00
Texas Instruments, Marielundvej 46E, DK-2730 Herlev, Denmark
	  No news is good news, but nn is better!

news@teda.UUCP (Teraida Newsadm) (02/21/90)

davison@drivax.UUCP (Wayne Davison) writes:

>news@teda.UUCP (Teraida Newsadm) wrote:
>[experiencing abnormal exits from nnmaster]
>} My solution is to run nnmaster from cron every 15 minutes. 

>Are you sure you're still getting the entire database updated?  The only

To update on my previous posting of last week.

I converted my nnmaster back to running in daemon mode instead of running
it from cron.  It seems to work fine now.  I am running the 6.3.10 release
of NN.  Upgrading your NN master to the current level seems to fix this
problem of NN master dies.  It may resolve the problem for other sites as
well.

Good luck.

-- 
Teraida News Admin (aka Mikel Lechner)		UUCP:  news@teraida.UUCP
Teradyne EDA, Inc.
5155 Old Ironsides Drive
Santa Clara, Ca 95054

seindal@skinfaxe.diku.dk (Rene' Seindal) (02/21/90)

storm@texas.dk (Kim F. Storm) writes:

> iand@labtam.oz (Ian Donaldson) writes:

> >We found that if nnmaster was being run, connecting to the remote
> >via NNTP, and the remote host dies, so does nnmaster.  It doesn't try
> >and re-establish the NNTP link.

> I believe this has changed with version 6.3.8.  We made some efforts
> to have nnmaster detect immediately that the nntp server dies, and
> just go to sleep immediately and wait for one -r period (you probably
> don't use less than 10 minutes intervals to stay on good terms with
> your NNTP server :-)

The nnmaster uses the same NNTP code as nn, and it will only try to reconnect
to the NNTP server, if it gets a NNTP response code that indicates a server
timeout.  This hardly ever happens for the master, since the NNTP servers I
have seen only times out, if the socket has been idle for a while (e.g., 15
minutes or so).

I have never seen nnmaster die, because of a server crash, but I have seen it
hang.  The problem is that the socket used hasn't got keepalive set.  If the
NNTP server crashes while nnmaster is connected, the master just sits there in
a read() call that never returns.  It has to be killed forcibly.  If keepalive
had been set, it would get a SIGPIPE, that whould interrupt the read() call.

There was a note about this in the file NNTP, saying I had seen this
behaviour, but didn't know why.  I do now, so it will be fixed in 6.4.

Rene' Seindal (seindal@diku.dk)

duncan@helium.siesoft.co.uk (Duncan McEwan) (02/26/90)

davison@drivax.UUCP (Wayne Davison) writes:

>If you have a pre-6.3.6 version of nn, you still have a bug in the name
>compression routine that will cause nn to crash when the name header is
>longer than 256 (or so) bytes.

I "discovered" a bug that sounds just like this one in my 6.3.1 sources
last week, but on grabbing 6.3.7 over the weekend and having a quick
look at it, the same bug still appears to be there -- so perhaps it is
a different one to that mentioned by Wayne.

The one I found is in pack_name.c ...

> pack_name(dest, source, length)
> char *dest, *source;
> int length;
> {
>    register char *p, *q, *r, c;
>    register int n;
>    char namebuf[129], *name;
>    ...
>    p = source, q = namebuf, n = 0;
>    
>    ...
>
>	*q++ = c;

As far as I can see, there is nothing to protect *q++ from overwriting
the stack if the name in the From: line is longer than 129 chars.  Last
week we received just such an article which caused nnmaster to core dump
everytime it tried to process it.  The hack fix I applied was to guard
the assignment statment by a check on the length of (q - namebuf).
I am uncertain as to whether this will have subtle adverse effects on the
remainder of the name parsing code (since it is now possible for there to
be mismatched parenthesis, etc).

The above source fragment (and in fact the whole pack_name.c file) is identical
in 6.3.1 and 6.3.7).  Perhaps the above bug has been fixed elsewhere in
the routines that read the header into memory, but from a quick look through
open_news_article() and parse_header() I couldn't see that it had.  Can
anyone tell me if I am missing something, or has this particular bug not
been fixed until after 6.3.7?

Duncan.

storm@texas.dk (Kim F. Storm) (02/27/90)

>davison@drivax.UUCP (Wayne Davison) writes:

>>If you have a pre-6.3.6 version of nn, you still have a bug in the name
>>compression routine that will cause nn to crash when the name header is
>>longer than 256 (or so) bytes.

duncan@helium.siesoft.co.uk (Duncan McEwan) writes:

>I "discovered" a bug that sounds just like this one in my 6.3.1 sources
>last week, but on grabbing 6.3.7 over the weekend and having a quick
>look at it, the same bug still appears to be there.

>Can anyone tell me if I am missing something, or has this particular
>bug not been fixed until after 6.3.7?

The "pack_name" bug exists in the following versions:

	6.3.0 - 6.3.5
	6.3.7

It is fixed in:

	6.3.6
	6.3.8 - 6.3.10
	6.4

The problem was that 6.3.7 was released before I heard of this bug,
but I still managed to get it into the posted patch #6.  I thus had to
produce a patch #8 to fix the bug in 6.3.7 as well.

Patch 8 was posted here last week!

-- 
Kim F. Storm        storm@texas.dk        Tel +45 429 174 00
Texas Instruments, Marielundvej 46E, DK-2730 Herlev, Denmark
	  No news is good news, but nn is better!

davison@drivax.UUCP (Wayne Davison) (02/28/90)

I wrote:
} If you have a pre-6.3.6 version of nn, you still have a bug in the name
} compression routine [...]

Duncan McEwan (duncan@helium.siesoft.co.uk) adds:
} I "discovered" a bug that sounds just like this one in my 6.3.1 sources
} last week, but on grabbing 6.3.7 over the weekend and having a quick
} look at it, the same bug still appears to be there

Ahh, the joys of nn patchlevels bit me on my oversimplification there.  As
mentioned before, the bug was fixed in usenet version 6.3.6, but not until
ftp version 6.3.8.  Patch #8 is only about 2K and fixes the name routine in
a similar manner to what you've already done.  It's so small, that I think
I'll follow Kim's recent example and simply include the patch on the end of
this message for all those people who missed it the last time.
-- 
Wayne Davison           \  /| / /| \/ /| /(_)    davison%drivax@uts.amdahl.com
davison@drivax.UUCP    (_)/ |/ /\| / / |/  \         ...!amdahl!drivax!davison
----8<------8<------8<------8<-----cut here----->8------>8------>8------>8----
From: storm@texas.dk 

This is patch #8 to nn release 6.3.

This patch redoes the fixes for the "long From: line bug" that was also
fixed in patch 6, but unfortunately "reverted" by patch 7 to align it
with the release 6.3.7 made available via anon-ftp before patch 6 was
posted.  See also patchlevel.h.

++Kim Storm

--------------------- CUT HERE ----------------------
*** /usr/storm/nn6.3.7/patchlevel.h	Fri Sep  8 12:46:52 1989
--- patchlevel.h	Fri Sep 15 19:05:47 1989
***************
*** 15,21 ****
   *	1989-08-22:  Patch 5: db.c
   *	1989-08-25:  Patch 6: admin.c pack_date.c
   *	1989-09-08:  Patch 7: several files
   */
  
! #define PATCHLEVEL 7
  
--- 15,34 ----
   *	1989-08-22:  Patch 5: db.c
   *	1989-08-25:  Patch 6: admin.c pack_date.c
   *	1989-09-08:  Patch 7: several files
+  *
+  *	NOTICE: Release 6.3.7 was distributed for anon-ftp before patch 6
+  *		was officially released on usenet.  Unfortunately, the
+  *		patch posted as #6 is not the patch #6 indicated above,
+  *		because the "Long From: line bug" fixed by the posted  
+  *		patch #6 was still present in the 6.3.7 available via ftp.
+  *
+  *	Therefore, future patches relating to 6.3.7 will use the normal
+  *	patch numbering scheme, while future patches to the originally
+  *	posted nn will be numbered 61, 62, etc. (if any - which I don't 
+  *	hope).
+  *
+  *	1989-09-15:  Patch 8: pack_name.c nntp.c
   */
  
! #define PATCHLEVEL 8
  

*** /usr/storm/nn6.3.7/pack_name.c	Fri Sep  8 12:46:52 1989
--- pack_name.c	Mon Sep 11 12:37:16 1989
***************
*** 183,188 ****
--- 183,189 ----
  	return 0;
  
      p = source, q = namebuf, n = 0;
+     maxq = namebuf + sizeof namebuf - 1;
      
  new_partition:
      for (i = SEP_MAXIMUM; --i >= 0; separator[i] = NULL);
***************
*** 211,216 ****
--- 212,218 ----
  	    continue;
  	}
  	if (n > 1) continue;
+ 	if (q >= maxq) break;
  	*q++ = c;
  	if (IS_SEPARATOR(c)) {
  	    switch (sep = (Class[c] & 0xff)) {

*** /usr/storm/nn6.3.7/nntp.c	Fri Sep  8 12:46:51 1989
--- nntp.c	Fri Sep 15 13:54:28 1989
***************
*** 44,49 ****
--- 44,51 ----
  
  import int errno, sys_nerr;
  import char *sys_errlist[];
+ extern int user_error();
+ extern int sys_error();
  
  #define syserr() (errno >= 0 && errno < sys_nerr ? \
  		  sys_errlist[errno] : "Unknown error.")
----8<------8<------8<------8<-----the end------>8------>8------>8------>8----