[net.news.b] expire takes 73 minutes of cpu?!?!?

reid@Cascade.ARPA (11/02/84)

We run 2.10.2 news, essentially as distributed in net.sources.
Lacking any particular documentation on expire except the out-of-date man
page that came with 4.2BSD, and reading through the source for expire
enough to see that it is not obvious, I have been doing my expiring in the
following way:

% grep expire /usr/adm/daily.sh
/usr/lib/news/expire -n arpa.unix-wizards -e30 -a
/usr/lib/news/expire -n net.sources  -e15 -a
/usr/lib/news/expire -e30 -n all
/usr/lib/news/expire -n net.singles -n net.flame -n net.politics -n net.religion  -e10

This seems to more or less work, though it has left some very strange things
in my history files from time to time. My complaint is that it takes 3 hours
of wall clock time and 73 minutes of CPU time on an idle 750 to run these 4
expire commands:

% lastcomm expire | head -8
expire     S     news     __        762 secs Fri Oct 26 07:40
expire     S     news     __       2258 secs Fri Oct 26 06:38
expire     S     news     __        681 secs Fri Oct 26 05:54
expire     S     news     __        720 secs Fri Oct 26 04:55
expire     S     news     __        536 secs Thu Oct 25 06:58
expire     S     news     __       2454 secs Thu Oct 25 05:29
expire     S     news     __        253 secs Thu Oct 25 05:13
expire     S     news     __        294 secs Thu Oct 25 04:54

Does everybody's expire take this long? If not, what am I doing wrong?
If so, does anybody but me think this is too much?

chuqui@nsc.UUCP (Zonker T. Chuqui) (11/05/84)

In article <1069@Cascade.ARPA> reid@Cascade.ARPA writes:
>We run 2.10.2 news, essentially as distributed in net.sources.
>
>% grep expire /usr/adm/daily.sh
>/usr/lib/news/expire -n arpa.unix-wizards -e30 -a
>/usr/lib/news/expire -n net.sources  -e15 -a
>/usr/lib/news/expire -e30 -n all
>/usr/lib/news/expire -n net.singles -n net.flame -n net.politics -n net.religion  -e10

A more efficient way of doing this for 2.10.2 would be:

/usr/lib/news/expire -a arpa.unix-wizards -e30
/usr/lib/news/expire -a net.sources -e15
/usr/lib/news/expire -n net.singles net.flame net.politics net.religion -e10

One change to expire is that the -a flag now accepts arguments, so the
first expire will do the work of both the original first and third. It
will expire everything AND archive only arpa.unix-wizards. You have to be
rather familiar with the expire source to figure this out-- the code for it
isn't obvious. Previous versions of expire had it so that the -a flag was
an all or nothing situation.

>This seems to more or less work, though it has left some very strange things
>in my history files from time to time. My complaint is that it takes 3 hours
>of wall clock time and 73 minutes of CPU time on an idle 750 to run these 4
>expire commands:
>
>Does everybody's expire take this long? If not, what am I doing wrong?
>If so, does anybody but me think this is too much?

Expire is, to put it nicely, a hog. Your figures aren't out of line with
what you are asking it to do. Cutting out that fourth expire will help, and
if you can keep net.singles et all for 15 days instead of 10 this MIGHT
(untested! untested!) work:

	/usr/lib/news/expire -e15 -a net.sources -n  net.singles net.flame net.politics net.religion 


chuq
-- 
From the Department of Bistromatics:                   Chuq Von Rospach
{cbosgd,decwrl,fortune,hplabs,ihnp4,seismo}!nsc!chuqui  nsc!chuqui@decwrl.ARPA

  I'd know those eyes from a million years away....

geoff@desint.UUCP (Geoff Kuenning) (11/10/84)

From chuqui@nsc.UUCP:

>Expire is, to put it nicely, a hog.

The current expire opens up every article file to look for an "Expires:" line
in the header.  To find out how much this costs (approximately), I did:

	cd /usr/spool/news
	time find . -type f -print | xargs cat >/dev/null

(In retrospect head -5 would have been more accurate, but it's not off by too
much).  It ran for 45 minutes before I had to abort it, and produced exactly
the same seeking pattern as expire.  My normal expires run somewhere from an
hour to 1:15 when there is no other disk activity, and eat essentially 100%
of the seek time on the drive.  
 
The obvious solution is to put the expiration date in the history file.  This
is a bit beyond my current free-time level.  So I was wondering about doing
a shell script something like this:

	break up /usr/lib/news/history into article pathnames
	sort the list, and 'comm' it against yesterday's list to get a list of
		newly-arrived articles
	Append their pathnames and expiration dates to a file called
		/usr/lib/news/expdates

From here, it is fairly easy to see how to expire without opening lots of
files.  Only, when I plot it out a bit more, it becomes obvious that you need
to write at least one program that calls getdate.y to crack the dates and
has the smarts to expire based on newsgroup, Expires line, and Date-Received
and such lines.  So that doesn't seem like much of an approach, either.

Can anybody out there come up with a quick hack to cut down on these multiple
opens?  Or does somebody maybe have the time to do it right?
-- 

	Geoff Kuenning
	First Systems Corporation
	...!ihnp4!trwrb!desint!geoff

henry@utzoo.UUCP (Henry Spencer) (11/13/84)

We've got inews modified to put the expiry date (if any) in the history
file, and expire modified to look at it.  But this inews/expire pair are
based on 2.10 (not even 2.10.1) and I haven't had a chance to compare it
to 2.10.2 yet to decide whether the changes are compatible.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

chuqui@nsc.UUCP (Cheshire Chuqui) (11/17/84)

In article <206@desint.UUCP> geoff@desint.UUCP (Geoff Kuenning) writes:
>
>The current expire opens up every article file to look for an "Expires:" line
>in the header.
>
>My normal expires run somewhere from an
>hour to 1:15 when there is no other disk activity, and eat essentially 100%
>of the seek time on the drive.  
> 
>The obvious solution is to put the expiration date in the history file.

Unfortunately, it isn't quite so obvious. Expire has the '-e' flag that 
changes the time to expire, plus the '-i' and '-I' flags that cause it to
ignore 'Expires:' header lines. Use of any of these flags make generating
expiration dates for the history file at reception time impossible. If you
are willing to use only the default cases of expire, it would help, but I
have yet to see a news site that does that. For example, I set DFLTEXP (the
defs.h value for when to get rid of things) rather high, usually 30 days or
so, and then use a series of expires with the '-e' option to keep the data
base to a specific size depending on how much disk space I've got allocated
to it. 

What might be useful would be to add code so that inews flags articles with
'Expires:' lines to some file, say in the form 
'<article_id> <expiration_date>', if we do that then expire can use the
existing date in the history file as the basis for the expiration and
reference the expiration date from this other file if neccessary. If might
also be possible to simply flag articles with 'Expires:' in it in the
history file and get expire to only look at them, saving us another file at
the expense of a change to the history file format. 

If we DO change the history file format, what does this break? Anyone out
there have any ideas?

chuq
-- 
From the Department of Bistromatics:                   Chuq Von Rospach
{cbosgd,decwrl,fortune,hplabs,ihnp4,seismo}!nsc!chuqui  nsc!chuqui@decwrl.ARPA

  This plane is equipped with 4 emergency exits, at the front and back of
  the plane and two above the wings. Please note that the plane will be
  travelling at an average altitude of 31,000 feet, so any use of these
  exits in an emergency situation will most likely be futile.

lepreau@utah-cs.UUCP (Jay Lepreau) (11/19/84)

Chuq claims that the use of "expire -e N" negates the value of putting
the expiration date in the history file.  Huh??  I'm no expert on the
news software, but that makes no sense to me.  The object is to avoid
opening every news article to find the poster-specified expiration
date.  Putting that date in the history file doesn't change one bit the
algorithm for determining the actual expiration date.

Jay Lepreau

henry@utzoo.UUCP (Henry Spencer) (11/20/84)

> >The obvious solution is to put the expiration date in the history file.
> 
> Unfortunately, it isn't quite so obvious. Expire has the '-e' flag that 
> changes the time to expire, plus the '-i' and '-I' flags that cause it to
> ignore 'Expires:' header lines. Use of any of these flags make generating
> expiration dates for the history file at reception time impossible. ...

It's still dead simple.  If the file came in with no explicit expiry date,
you simply record it as such in the history file.  (The way we do it is
to give the expiry date as "-" in this case.)  The history file is
already recording the arrival date, which (around here, at least) is the
other date that expire needs to look at.  Expire can make its decisions
exactly the same way it has in the past, but rather more quickly.

> If we DO change the history file format, what does this break? Anyone out
> there have any ideas?

It broke practically nothing here; I had to make very small adjustments
in one or two places.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

chuqui@nsc.UUCP (Cheshire Chuqui) (11/20/84)

In article <3114@utah-cs.UUCP> lepreau@utah-cs.UUCP (Jay Lepreau) writes:
>Chuq claims that the use of "expire -e N" negates the value of putting
>the expiration date in the history file.  Huh??  I'm no expert on the
>news software, but that makes no sense to me.  The object is to avoid
>opening every news article to find the poster-specified expiration
>date.  Putting that date in the history file doesn't change one bit the
>algorithm for determining the actual expiration date.
>

here is the situation. Assume we put the expiration date in the history
file. When a message comes in, we add the DFLTEXP value (say 14 days) to
the time received to get the expiration time (november 1 becomes november
15 for expiration). If you then run expire with the -e option to expire
something at, say, 10 days, that article should really expire november 11.
the expiration time becomes useless because the system would still have to
read the file and add the -e expiration time to it to find out if it should
be expired.

A better alternative is to make sure the date in the history file is the
date received (in an easily usable format), add a flag if there is an 
Expires: header line (or better yet, do away with it completely), and base
expiration on the date in the history time + DFLTEXP or the -e value,
opening up the article only to get Expires: lines. Come to think of it,
that shouldn't be too much trouble to implement. Me and my copious free
time.....

chuq


-- 
From the Department of Bistromatics:                   Chuq Von Rospach
{cbosgd,decwrl,fortune,hplabs,ihnp4,seismo}!nsc!chuqui  nsc!chuqui@decwrl.ARPA

  This plane is equipped with 4 emergency exits, at the front and back of
  the plane and two above the wings. Please note that the plane will be
  travelling at an average altitude of 31,000 feet, so any use of these
  exits in an emergency situation will most likely be futile.

phil@amdcad.UUCP (Phil Ngai) (11/20/84)

I bet Jay thought the date put in the history file would be
derived from the Expires: line or the date of receipt if there
was no Expires: line. If you want even more flexibility, you could
also store the date the article was received.

> In article <3114@utah-cs.UUCP> lepreau@utah-cs.UUCP (Jay Lepreau) writes:
> >Chuq claims that the use of "expire -e N" negates the value of putting
> >the expiration date in the history file.  Huh??  I'm no expert on the
> >news software, but that makes no sense to me.
> 
> here is the situation. Assume we put the expiration date in the history
> file. When a message comes in, we add the DFLTEXP value (say 14 days) to
> the time received to get the expiration time (november 1 becomes november
> 15 for expiration).
-- 
 When you've seen one tree, you've seen them all. -Bonzo Reagan

 Phil Ngai (408) 749-5790
 UUCP: {ucbvax,decwrl,ihnp4,allegra,intelca}!amd!phil
 ARPA: amdcad!phil@decwrl.ARPA

gnu@sun.uucp (John Gilmore) (11/22/84)

The problem here is that there are two algorithms for expiration date:

(1) messages without Expires:
	take receipt date, add "-e" value or DFLTEXP if none was used.

(2) messages with Expires:
	use the specified date

So, it sounds to me like the info you need in the history file is:

(A)	Was Expires: specified?
(B)     if A==no,  the receipt date
	if A==yes, the specified expiration date

Then you'll never need to look in the message to finger out either case.

rick@seismo.UUCP (Rick Adams) (11/24/84)

The following cheap fix will greatly speed up the case where
several expires are run consecutively. It will only 
do any good if you are using the -n option.

A much faster expire (roughly 3 times) will be part of 2.10.3 in about
a month or so.

---rick

*** expire.c	Fri Nov 23 17:12:54 1984
--- expire.c.new	Fri Nov 23 17:18:10 1984
***************
*** 515,520
  		if (sscanf(afline,"%s %ld %ld %c",nbuf,&maxart, &minart,
  		    &cansub) < 4)
  			xerror("Active file corrupt");
  		minart = maxart > 0 ? maxart : 1L;
  		/* Change a group name from
  		   a.b.c to a/b/c */

--- 515,525 -----
  		if (sscanf(afline,"%s %ld %ld %c",nbuf,&maxart, &minart,
  		    &cansub) < 4)
  			xerror("Active file corrupt");
+ 		if (!ngmatch(nbuf,ngpat) {
+ 			fputs(afline, nhfd);
+ 			continue;
+ 		}
+ 			
  		minart = maxart > 0 ? maxart : 1L;
  		/* Change a group name from
  		   a.b.c to a/b/c */