[news.software.b] Auto-expiration of news

wcf@psuhcx (Bill Fenner) (11/22/88)

Does anyone have a good way to expire news automatically when the news
partition gets full?  We only have a 25 meg partition for news, and it
often manages to fill up on weekends, and it's a big pain to come in
to find a console log 5 inches thick with logs of /usr/spool/news: write
failed, filesystem full.

We're using nntp to receive articles from psuvax1 (main feed), and uucp
to send back to psuvax1 as well as to send to and from hogbbs, a FidoNet
BBS, which doesn't create many incoming messages...

  Bill

wcs@skep2.ATT.COM (Bill.Stewart.[ho95c]) (11/30/88)

In article <1066@psuhcx.psu.edu> wcf@psuhcx (Bill Fenner) writes:
> Does anyone have a good way to expire news automatically when the news
> partition gets full?  We only have a 25 meg partition for news, and it

Here's my "trashnews" script.  I run it hourly from cron, which
seems to be often enough.  It uses "df" to find how many blocks
are free, and if there aren't enough, it grinds through the
history file looking for articles to trash (starts at the top,
works down - it doesn't care when the article *should* have expired.)

Caveats:
- You need to be running a version of news with one history file.
- If your "df" output format is different than System V you
	will have to modify the sed / awk script to pick out
	the right field.
- it doesn't clean up the history file - expire will have to do
	this for you.  I run expire -r weekly.

========================  cut here ============================================

####### Zap netnews until disk space is adequate. 
TARGET="/usr/spool"
remove="rm -f"	## remove="echo" for debugging
debug=":"	## debug="echo"

SPOOLDIR=/usr/spool/news	## Where the articles live
LIBDIR=/usr/lib/news		## Where the data files live
trashgroups="comp/mail/maps comp/binaries/atari talk/politics/misc comp/sys/atari"
					## attack these brutally

cd $SPOOLDIR	## Where the articles live

echo "===================== `date`"
limit=5000
export limit LIBDIR remove debug
if [ "$1" = "-x" ] ; then set -x ; remove="echo remove"; debug=echo ; shift ; fi
case "$1" in
	[0-9]*)	limit=$1 ; shift ;;
	esac

######## Make sure $TARGET has inodes (evil System V bug!)
if df $TARGET | sed 's/(/ (/' |
	awk ' { { print "df inodes ", $0 ; if ( $5 < 1000 ) exit 0 ; else exit 1 ; } } ' #>/dev/null
then echo trashing inodes ; find comp/mail/maps  -type f -print | xargs $remove ; find control talk rec/humor -type f -mtime +3 -print | xargs $remove
else echo inodes ok
fi

######## Make sure $LIBDIR has space
if df /usr | sed 's/(/ (/' |
	awk ' { { print "df,limit ", $3, '$limit' ; if ( $3 < '$limit' ) exit 0 ; else exit 1 ; } } ' #>/dev/null
then echo remove /usr/lib/news/ohis* /usr/lib/news/olog* ; $remove /usr/lib/news/ohis* /usr/lib/news/olog*
else echo /usr ok
fi
###############

(	## generate list of files to trash
#echo $LIBDIR/olog* $LIBDIR/ohistory 
find $trashgroups -type f -mtime +2 -print 2>/dev/null
sed -e 's/.*	//' -e '/^$/d' -e '/cancel/d' -e 's/\./\//g' $LIBDIR/history
) | while read victim victims ; do
	if [ -f "$victim" ] ; then
		if df $TARGET | sed 's/(/ (/' |
			awk ' { { print "df,limit ", $3, '$limit' ; if ( $3 < '$limit' ) exit 0 ; else exit 1 ; } } ' #>/dev/null
		then echo remove $victim $victims ; $remove $victim $victims
		else echo enough ; break
		fi
	else $debug $victim already gone
	fi
done
#################################### cut here ################
exit 0
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218 Holmdel NJ 201-949-0705 ho95c.att.com!wcs
#
#	One Bell System - it works!

dtynan@sultra.UUCP (Der Tynan) (12/01/88)

From article <1066@psuhcx.psu.edu>, by wcf@psuhcx (Bill Fenner):
> 
> Does anyone have a good way to expire news automatically when the news
> partition gets full?  We only have a 25 meg partition for news, and it
> often manages to fill up on weekends, and it's a big pain to come in
> to find a console log 5 inches thick with logs of /usr/spool/news: write
> failed, filesystem full.

I have a similar problem.  Having given it some thought, I have come up with
a clean solution that (someday) I will implement in 2.11 (or whatever).  On
the other hand, if any of the *new-and-improved* news software people are
reading this, perhaps they'd care to comment?

Anyway, the idea is this.  In the NEWS/active file, a new field is introduced
in the tradition of the 'm' field for 'moderated'.  It is a boolean ('y'/'n'?),
which indicates that the given newsgroup is not read at this site.  In this
way, a nightly (or weekly) cron program would zip through all the .newsrc
files, to see what groups aren't subscribed to, and update the 'active' file.
On the other hand, if someone subscribes to a currently unavailable group,
the daemon would reactivate it.  And vnews/readnews/whatever would inform
the reader that the group isn't currently carried, but will appear in a few
days.  Of course, certain groups (such as comp.mail.maps) would have a special
mark saying that they must ALWAYS be subscribed to ('a' perhaps?). Then, rnews
as part of its processing, would look at this flag, and if necessary, dump
the article.

Currently, the two ways of doing this, are to remove the group from the
active file, in which case the 'junk' group fills up like nobodys business.
Or, conversely, to have the sysadmin at the remote feed modify the 'sys'
file, so that certain groups weren't sent.  This is awkward, because changes
may occur very frequently.  Both schemes also mean that the 'checkgroups'
messages will bomb fairly severely.  In this age of Trailblazers, I don't
think anyone is worried about line bandwidth, but just disk space (20Mb/week),
so this scheme would allow them to carry only those groups that people actually
read.  Comments?
						- Der
-- 
	dtynan@zorba.Tynan.COM  (Dermot Tynan @ Tynan Computers)
	{apple,mips,pyramid,uunet}!Tynan.COM!dtynan

 ---  If the Law is for the People, then why do we need Lawyers? ---

nagel@beaver.ics.uci.edu (Mark Nagel) (12/01/88)

In article <2694@sultra.UUCP>, dtynan@sultra (Der Tynan) writes:
|
|Anyway, the idea is this.  In the NEWS/active file, a new field is
|introduced in the tradition of the 'm' field for 'moderated'.  It is
|a boolean ('y'/'n'?), which indicates that the given newsgroup is not
|read at this site.  In this way, a nightly (or weekly) cron program
|would zip through all the .newsrc files, to see what groups aren't
|subscribed to, and update the 'active' file.  On the other hand, if
|someone subscribes to a currently unavailable group, the daemon would
|reactivate it.  And vnews/readnews/whatever would inform the reader
|that the group isn't currently carried, but will appear in a few
|days.  Of course, certain groups (such as comp.mail.maps) would have
|a special mark saying that they must ALWAYS be subscribed to ('a'
|perhaps?). Then, rnews as part of its processing, would look at this
|flag, and if necessary, dump the article.

I'd prefer that if we are going to be adding a field to the active file, then
the field should be the last access date of each newsgroup.  This is for
these reasons:

  1. not everybody uses a news reader that uses .newsrc (e.g. vn, Gnews)
  2. many people use a distributed news system with a central
     server that makes it tough to tell who reads what.

If the field was interpreted as "avg number of requests per day", then
you get even more information so you can regulate expiration even
better.  Of course, all of this would seem to require that the active
file become a non-readable entity, with some function of inews or
whatever needed to get information from the active file so that it can
be updated correctly.  Or great cooperation from all the news readers
out there.  From this viewpoint, the cron method is superior.

I'm definitely in favor of anything that automatically trims groups
that are unpopular.  Any such scheme must be used carefully by
non-leaf nodes so that news can pass through to downstream feeds (I
use "must" loosely here, given the nature of the net).

Mark Nagel @ UC Irvine, Dept of Info and Comp Sci
ARPA: nagel@ics.uci.edu              | radiation: n. ... 2. smog with an
UUCP: {sdcsvax,ucbvax}!ucivax!nagel  | attitude.

amb@dasys1.UUCP (A. M. Boardman) (12/02/88)

In article <2694@sultra.UUCP> dtynan@sultra.UUCP (Der Tynan) writes:
>On the other hand, if someone subscribes to a currently unavailable group,
>the daemon would reactivate it.  And vnews/readnews/whatever would inform
>the reader that the group isn't currently carried, but will appear in a few
>days.  Of course, certain groups (such as comp.mail.maps) would have a special

Under this system, how do you get articles to propogate to the rest of the net?

Sample situation:

backbone!feedsite!site-with-only-one-or-two-feeds

If nobody on the feedsite(s) happened to subscribe to a particular newsgroup,
all the stuff posted downstream from it in that group never gets out to the
rest of the net.  Think about it.

---
            "On a scale from one to ten, that's pretty bad."
A.M.Boardman  Big Electric Cat PA Unix  ![hoptoad|phri|(uunet)]!dasys1!amb

stu@jpusa1.UUCP (Stu Heiss) (12/03/88)

In article <2694@sultra.UUCP> dtynan@sultra.UUCP (Der Tynan) writes:
-From article <1066@psuhcx.psu.edu>, by wcf@psuhcx (Bill Fenner):
-> 
-> Does anyone have a good way to expire news automatically when the news
-> partition gets full?
-I have a similar problem.  Having given it some thought, I have come up with
-a clean solution that (someday) I will implement in 2.11 (or whatever).
-Anyway, the idea is this.  In the NEWS/active file, a new field is introduced
-in the tradition of the 'm' field for 'moderated'.  It is a boolean ('y'/'n'?),
-which indicates that the given newsgroup is not read at this site.  In this
-way, a nightly (or weekly) cron program would zip through all the .newsrc
-files, to see what groups aren't subscribed to, and update the 'active' file.
-On the other hand, if someone subscribes to a currently unavailable group,
-the daemon would reactivate it.  And vnews/readnews/whatever would inform
-the reader that the group isn't currently carried, but will appear in a few
-days.  Of course, certain groups (such as comp.mail.maps) would have a special
-mark saying that they must ALWAYS be subscribed to ('a' perhaps?).

We do something similar with a couple of shell scripts and no mods to the news
software - works quite nicely.  I use the previously posted script (inactng.sh)
to get a list of inactive (nobody reads them) newsgroups and rm the articles in
the associated directories.  In addition, we always junk 'junk' and never junk
'comp.mail.maps' and 'news.announce.important'.  Run 'junkarts.sh' as often
as necessary.  In my hourly news startup and nitely expire scripts I have:

/usr/lib/news/expire.sh:find $LIB/active -newer $LIB/lastjunk -exec $LIB/junkarts.sh ';' -exec touch $LIB/lastjunk ';'
/usr/lib/news/rnews.sh:find $LIB/active -newer $LIB/lastjunk -exec $LIB/junkarts.sh ';' -exec touch $LIB/lastjunk ';'

This way if the active file gets modified junkarts is run.
Following is junkarts.sh and the utility inactng.sh.

#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create:
#	/usr/lib/news/junkarts.sh
#	/usr/lib/news/inactng.sh
# This archive created: Fri Dec  2 10:58:35 1988
# By:	stu (JPUSA - Chicago, IL)
export PATH; PATH=/bin:/usr/bin:$PATH
echo shar: "x - '/usr/lib/news/junkarts.sh'"
if test -f '/usr/lib/news/junkarts.sh'
then
	echo shar: "will not over-write existing file '/usr/lib/news/junkarts.sh'"
else
cat << \SHAR_EOF/usr/lib/news/junkarts.sh > '/usr/lib/news/junkarts.sh'
:
PATH=/bin:/usr/bin
export PATH
lib=/usr/lib/news
spool=/usr/spool/news
tmpa=/tmp/.a$$
tmpb=/tmp/.b$$
trap 'rm -f $tmpa $tmpb;exit' 0 1 2 3 15
junkalways=false

cd $spool
$lib/inactng.sh|tr . /|grep -v comp/mail/maps|grep -v news/announce/important > $tmpa
cat -s $tmpa|while read d
do
	test -d $d && ls $d|sed "s;^;$d/;"
done|while read f
do
	test -f $f && echo $f
done > $tmpb
$junkalways && find junk -type f -print >> $tmpb
xargs < $tmpb rm -f $f
SHAR_EOF/usr/lib/news/junkarts.sh
if test 452 -ne "`wc -c < '/usr/lib/news/junkarts.sh'`"
then
	echo shar: "error transmitting '/usr/lib/news/junkarts.sh'" '(should have been 452 characters)'
fi
chmod +x '/usr/lib/news/junkarts.sh'
fi
echo shar: "x - '/usr/lib/news/inactng.sh'"
if test -f '/usr/lib/news/inactng.sh'
then
	echo shar: "will not over-write existing file '/usr/lib/news/inactng.sh'"
else
cat << \SHAR_EOF/usr/lib/news/inactng.sh > '/usr/lib/news/inactng.sh'
:
ng1=/tmp/ng1$$
ng2=/tmp/ng2$$
trap 'rm -f $ng1 $ng2;exit' 0 1 2 3 15
ACTIVE=/usr/lib/news/active
cat `sed 's;[^:]*:[^:]*:[^:]*:[^:]*:[^:]*:\([^:]*\):.*;\1/.newsrc;' /etc/passwd` 2> /dev/null \
| sed -n 's/\([^:]*\):.*$/\1/p' |sort |uniq  > $ng1
sed 's/ .*//' $ACTIVE |sort > $ng2
diff $ng1 $ng2 | sed -n 's/^> //p'
SHAR_EOF/usr/lib/news/inactng.sh
if test 476 -ne "`wc -c < '/usr/lib/news/inactng.sh'`"
then
	echo shar: "error transmitting '/usr/lib/news/inactng.sh'" '(should have been 476 characters)'
fi
chmod +x '/usr/lib/news/inactng.sh'
fi
exit 0
#	End of shell archive

-- 
Stu Heiss {spl1,uchicago.edu!gargoyle,ddsw1}!jpusa1!stu

henry@utzoo.uucp (Henry Spencer) (12/03/88)

In article <2694@sultra.UUCP> dtynan@sultra.UUCP (Der Tynan) writes:
>Anyway, the idea is this.  In the NEWS/active file, a new field is introduced
>in the tradition of the 'm' field for 'moderated'.  It is a boolean ('y'/'n'?),
>which indicates that the given newsgroup is not read at this site...

If all you want to do is to expire such newsgroups quickly, it's trivial
to build a control file for C News expire instead.  This won't
stop the stuff from being filed, but it will get it off the disk quickly.
It has the advantage of no incompatible changes to file formats, and use
of existing tools rather than inventing yet another wheel.

>... a nightly (or weekly) cron program would zip through all the .newsrc
>files, to see what groups aren't subscribed to, and update the 'active' file.

Note that any scheme which examines .newsrc needs to have some sort of
rule about how recent a .newsrc has to be before it is considered.
Otherwise a single user, who theoretically reads a lot of news but hasn't
been on for six months, fouls the whole thing up.
-- 
SunOSish, adj:  requiring      |     Henry Spencer at U of Toronto Zoology
32-bit bug numbers.            | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

dtynan@sultra.UUCP (Der Tynan) (12/06/88)

In article <8048@dasys1.UUCP>, amb@dasys1.UUCP (A. M. Boardman) writes:
> In article <2694@sultra.UUCP> dtynan@sultra.UUCP (Der Tynan) writes:
> >On the other hand, if someone subscribes to a currently unavailable group,
> >the daemon would reactivate it.  And vnews/readnews/whatever would inform
> >the reader that the group isn't currently carried, but will appear in a few
> >days.
>Under this system, how do you get articles to propogate to the rest of the net?
>
> A.M.Boardman  Big Electric Cat PA Unix  ![hoptoad|phri|(uunet)]!dasys1!amb

I thought of that already :-)  In *no* way, should the downstream sites be
'censored' by this system.  In the case of a batched newssite, inews would
continue to generate news batches for neighbours, but wouldn't save the
articles in the local news spool directory.  In the case of a non-batched link,
something similar could be done.  Since my original posting, I have gotten
some interesting feedback.  Amos Shapir suggested using a date field in the
'active' file, to see when the last time the group was read.  This is because
some people subscribe to certain groups, but haven't actually read them in
months.  Another possibility would be a 'count' field, for each newsgroup,
which reflected the number of people who had read the group.  Each week, then,
the cron utility would zero the counts.  Comments?
						- Der
-- 
	dtynan@zorba.Tynan.COM  (Dermot Tynan @ Tynan Computers)
	{apple,mips,pyramid,uunet}!Tynan.COM!dtynan

 ---  If the Law is for the People, then why do we need Lawyers? ---

nyssa@terminus.UUCP (The Prime Minister) (12/06/88)

In article <1988Dec2.184323.7511@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>Note that any scheme which examines .newsrc needs to have some sort of
>rule about how recent a .newsrc has to be before it is considered.
>Otherwise a single user, who theoretically reads a lot of news but hasn't
>been on for six months, fouls the whole thing up.

The selection criteria for pexpire include:

.newsrc touched within a compile time specificied time period OR
within the history expiration time

The newsgroup being subscribed to

The last article read from that newsgroup being in the range of
articles on the system.

Therefore, a user who hasn't read news for 6 months won't be looked
at, unless you save the news for that long.
-- 
James C. Armstrong, Jr		nyssa@terminus.UUCP