[news.software.b] Cnews expire problem... need help

karish@mindcraft.com (Chuck Karish) (12/08/90)

In article <1990Dec7.130639.15803@bnr.ca> janick@bnr.ca
(Janick Bergeron) writes:
>For the last few days, I keep receiving the following message from
>Cnews/expire:
>
>expire problems:
>expire: wrong number of fields in `	660190899~-...'
>
>(Note: that's a <TAB> between the '`' and the first '6')
>
>How do I locate the offending article and force its expiry ??

There are two problems to deal with:

- fixing the history file so you stop getting mail from expire
- expiring the article

#1:    sed '/^\t660190899~-/d' < history > history.fixed

#2:    Use 'grep' instead of 'sed', and look at the file name
       on the end of the line.  If the file name isn't there, it's
       already been expired.

       Alternate solution: Just leave the file there.  Put an
       entry into crontab to remove articles older than the
       maximum age in your explist file.

New question: What causes the history lines to be mangled in
the first place?  I'm getting about eight or ten of them a
month.  Is it related to having my spool disk fill up?
-- 

	Chuck Karish		karish@mindcraft.com
	Mindcraft, Inc.		(415) 323-9000		

mrm@sceard.Sceard.COM (M.R.Murphy) (12/09/90)

In article <660596702.10086@mindcraft.com> karish@mindcraft.com (Chuck Karish) writes:
>In article <1990Dec7.130639.15803@bnr.ca> janick@bnr.ca
>(Janick Bergeron) writes:
>>For the last few days, I keep receiving the following message from
>>Cnews/expire:
>>
>>expire problems:
>>expire: wrong number of fields in `	660190899~-...'
>>
[...]
>
>New question: What causes the history lines to be mangled in
>the first place?  I'm getting about eight or ten of them a
>month.  Is it related to having my spool disk fill up?
>-- 
>
>	Chuck Karish		karish@mindcraft.com
>	Mindcraft, Inc.		(415) 323-9000		

In our case, it was a bad sector on the disk :-) :-(

The following code cleans up a history file so that mkdbm is happy with it,
and also replaces the single awk line that sifts a history file and prints
only lines that are after a given time that I used in a modified expire scheme.
The checking for goodness in a history line could be made fancier, but this is
enough to make mkdbm happy. Makes for a pretty fast expire, too. Every so often,
writing a short specialized tool in C is appropriate, though I'd rather use
awk :-) In the case of a bad disk block, awk groused about a record too long and
bailed. If 8192 is bad for a buffer here, somebody could get fancy with malloc,
or maybe just write the whole thing in one line of perl.

---- cut here ----
/*
 * exphist - scan history and write only good lines for mkdbm
 * usage is exphist time_from_getdate
 *
 * cursory examination of this code will show that it is snagged from mkdbm.
 *
 */
#include <stdio.h>
#include <string.h>

main(argc, argv)
int argc;
char *argv[];
{
	long		exptime;
	long		atol();
	static char	buff[8192];
	register char	*scan;
	register char	*line;

	if (argc < 2 || (exptime = atol(argv[1])) == 0) {
		fprintf(stderr, "Usage: exphist time\n");
		exit(2);
	}

	for (;;) {
		line = fgets(buff, sizeof(buff), stdin);
		if (line == NULL)
			break;
		scan = strchr(line, '\t');
		if (scan == NULL || line[strlen(line)-1] != '\n') {
			fprintf(stderr, "bad format: `%.60s'\n", line);
			continue;
		}
		if (atol(++scan) > exptime)
			fputs(line, stdout);
	}
	exit(0);
}
---- cut here ----

This is the fragment from expire:

...
now=`getdate now`
ago=`awk "/^\/expired\// {print ($now-(86400*\$(3)))} {next}" explist`
# replace the single-line awk with exphist
#awk "{split(\$2,dates,\"~\");if(dates[1]>$ago)print \$0}" history >history.n
exphist $ago < history >history.n
mkdbm history.n &&
[ -s history.n ] &&
mv history history.o &&		# install new ASCII history file
mv history.n history &&
rm -f history.pag &&		# and related dbm files
rm -f history.dir &&
mv history.n.pag history.pag &&
mv history.n.dir history.dir
...

Since expire will be executed from cron, the error messages from exphist
and mkdbm will show up in mail to somebody important should they occur.
Note the "&&" after the mkdbm step. This is added to keep history relatively
intact should there be a major failure.

None of this stuff takes too kindly to bad blocks, totally running out of
file system space, or inodes, or such, but then, what of UNIX(tm) does?
-- 
Mike Murphy  mrm@Sceard.COM  ucsd!sceard!mrm  +1 619 598 5874

karish@mindcraft.com (Chuck Karish) (12/10/90)

In article <1990Dec8.190114.15171@sceard.Sceard.COM> mrm@Sceard.COM
(M.R.Murphy) writes:
>
>The following code cleans up a history file so that mkdbm is happy with it,
>and also replaces the single awk line that sifts a history file and prints
>only lines that are after a given time that I used in a modified expire scheme.
>The checking for goodness in a history line could be made fancier, but this is
>enough to make mkdbm happy. Makes for a pretty fast expire, too. Every so often,
>writing a short specialized tool in C is appropriate, though I'd rather use
>awk :-)

This C program is needed only to avoid re-writing the whole history
file during checking.  On my machine, the mkdbm step takes much longer
than the scan anyway and I have enough disk space for a second copy
of history, so I use this one-liner in sed:

sed -n 's/^<.*	/p'	# The white space in the pattern is a tab.

>...
>now=`getdate now`
>ago=`awk "/^\/expired\// {print ($now-(86400*\$(3)))} {next}" explist`
># replace the single-line awk with exphist
>#awk "{split(\$2,dates,\"~\");if(dates[1]>$ago)print \$0}" history >history.n

Doesn't this reproduce the functionality specified by the 'expired'
line in the expire control file?
-- 

	Chuck Karish		karish@mindcraft.com
	Mindcraft, Inc.		(415) 323-9000		

mrm@sceard.Sceard.COM (M.R.Murphy) (12/10/90)

In article <660770337.20986@mindcraft.com> karish@mindcraft.com (Chuck Karish) writes:
>In article <1990Dec8.190114.15171@sceard.Sceard.COM> mrm@Sceard.COM
>(M.R.Murphy) writes:
>>
>>The following code cleans up a history file so that mkdbm is happy with it,
>>and also replaces the single awk line that sifts a history file and prints
>>only lines that are after a given time that I used in a modified expire scheme.
>>The checking for goodness in a history line could be made fancier, but this is
>>enough to make mkdbm happy. Makes for a pretty fast expire, too. Every so often,
>>writing a short specialized tool in C is appropriate, though I'd rather use
>>awk :-)
>
>This C program is needed only to avoid re-writing the whole history
>file during checking.  On my machine, the mkdbm step takes much longer
>than the scan anyway and I have enough disk space for a second copy
>of history, so I use this one-liner in sed:
>
>sed -n 's/^<.*	/p'	# The white space in the pattern is a tab.
>
>>...
>>now=`getdate now`
>>ago=`awk "/^\/expired\// {print ($now-(86400*\$(3)))} {next}" explist`
>># replace the single-line awk with exphist
>>#awk "{split(\$2,dates,\"~\");if(dates[1]>$ago)print \$0}" history >history.n
>
>Doesn't this reproduce the functionality specified by the 'expired'
>line in the expire control file?
>-- 
>
>	Chuck Karish		karish@mindcraft.com
>	Mindcraft, Inc.		(415) 323-9000		

The C program referenced in article <1990Dec8.190114.15171@sceard.Sceard.COM>
above does not just avoid re-writing the whole history file during checking.
It does reproduce the functionality specifed by the 'expired' line in the
expire control file, sort of, but the C News expire is not used at all in the
simple scheme for "expiration" that I posted a while back.

Expiration is maintaining the news database, that is, the articles that are the
ebb and flow of USENET as we know it, and the control of reception of duplicate
articles from other sites. The scheme is based on:

  1) don't accept an article from another site that has already been received,
     that is, that already exists in the history file, and
  2) don't keep old articles lying about wasting space.

Another function of the standard C News expire, that is, archiving, I think is
better separated. It is more reasonable to set up a sys file entry that sends
articles from newsgroups to be archived to an archiver when they are received
from the feed. The archiver can then be quite clever and selective about what
it bothers to archive. The less that the expiration process has to handle, the
better.

To accomplish this scheme, I split C News expiration into two separate parts,
expire, which maintains the history file and handles 1) above, and trasher
which gets rid of old articles and handles 2) above. BTW, the Expires: header
is ignored by trasher on the basis that it is only the business of a system's
administrators how long an article should take up space. I have since kissed
off the script that was trasher and replaced it with reap by
dt@yenta.alb.nm.us (thanks, david).

Expiration of the history file is just the creation of a new history file that
omits lines of the previous history file that are older than some particular
time. It need have nothing to do with whether the articles referenced by that
line are still around. I used the one-liner awk script

awk "{split(\$2,dates,\"~\");if(dates[1]>$ago)print \$0}" history >history.n

to do just that. I was happy enough with this part of the scheme until a bad
disk block corrupted the history file. Oops. Awk groused because a record was
too long for it to handle. Mkdbm groused because the line in the history file
was not up to its expectations for a valid line (simple and incomplete though
those expectations were). BTW, the corrupted part of the history file had a
less-than followed by some characters and a tab, so it would have passed the
sed test referenced by Chuck and still would have given mkdbm a problem. Unless
sed croaked on the line, too. :-)

To get around the problem of lines that mkdbm chokes on, I decided to snag the
code from mkdbm and twiddle it about a little so that it would just read history
lines on its standard input and write only lines on its standard output that
mkdbm would be happy with. As long as I was going to do that much, I might as
well have it do the check for old lines, too. That way, exphist, the new C
program, reads an old history file, deletes bad lines or lines that are too old,
and writes the output so that mkdbm can make the new history files. The awk line
above is then replaced by

exphist $ago <history >history.n

Then mkdbm, move the results around, and save the old stuff. Simple, no?

On our news machine, both the scan and the mkdbm are fast :-) Exphist and mkdbm
could have been combined, and would probably have been faster, but these tools
are more useful when separate. That's part of the UNIX(tm) philosophy.

Reap is a separate process for getting rid of old articles. It is completely
independent of the process of maintaining history. Reap also has the benefit
that it is:

  1) short enough so that I can understand it,
  2) flexible,
  3) fast,
  4) and, written by someone else so I didn't have to do it.
     (thanks again, david)

Again, the standard C News expire is not used at all. What I am talking about
here is an alternate method of maintaining the news database: articles and
history. Yes, it is a Really Good Thing to lock so that no other News
processing goes on during the history expiration. It is not necessary to
lock News processing during reaping. Everything needs to be locked against
itself running at the same time. Don't you just love crons that can't keep
things straight?

I really like C News. Thanks to its authors.
-- 
Mike Murphy  mrm@Sceard.COM  ucsd!sceard!mrm  +1 619 598 5874

karish@mindcraft.com (Chuck Karish) (12/11/90)

Henry Spencer and David Lawrence have each pointed out to me that my
previous remarks about Cnews locking were incomplete.

doexpire and newsrun each lock against another instance of themselves
when they start up.  There's a separate mechanism that locks the
whole article filing system, by creating a file called $NEWSCTL/LOCK.
This is invoked by the expire and relaynews programs, but only during
critical periods in their execution.

I submit that it's wise to use this mechanism when hacking on the
history file by hand.  It's easier than changing the cron entry for
expire, as Henry suggested, and it locks relaynews, too.

Note that people won't be able to post articles while this lock is set.
-- 

	Chuck Karish		karish@mindcraft.com
	Mindcraft, Inc.		(415) 323-9000		

henry@zoo.toronto.edu (Henry Spencer) (12/11/90)

In article <660852555.23022@mindcraft.com> karish@mindcraft.com (Chuck Karish) writes:
>I submit that it's wise to use this mechanism when hacking on the
>history file by hand.  It's easier than changing the cron entry for
>expire, as Henry suggested, and it locks relaynews, too.

That's why I suggested using locknews, which does this.  (Note, locknews !=
newslock.)  You'd really kind of like to lock out expire too, though, hence
the suggestion about the cron entry.
-- 
"The average pointer, statistically,    |Henry Spencer at U of Toronto Zoology
points somewhere in X." -Hugh Redelmeier| henry@zoo.toronto.edu   utzoo!henry