[news.software.b] Dynamic "smart" expiration?

dce@smsc.sony.com (David Elliott) (12/27/89)

Our site gets a pretty full newsfeed and has 250MB of space set aside
for news articles.

I've been trying to get the C news explist stuff set up for maximum
use of the space, but I am a little paranoid when it comes to raising
the expiration times because I've seen too many instances of news
logjams getting unstuck and filesystems overflowing.

I've seen programs that expire based on disk space, allowing you
to prioritize newsgroups and expire until there is a minimum amount
of free space, but this may come to late.

It seems to me that a better idea would be to have a program that
generates a list of files to remove in order of "removability".
When the unbatcher starts working, it would look at the space
available, and while there wasn't enough, it would remove files
from the top of said list.

This could even be used in concert with the current expiration
mechanism to allow for a general smooth removal of articles that
really are out of date.

The "removability" of a file would be a function of newsgroup
name, newsgroup size, and file age.  One might use a formula
like:

	removability = (X*size^2 + Y*age^2) - usefulness(newsgroup)

The "usefulness" function would be a table of constants supplied
by the administrator.  The values of X and Y are supplied for
each newsgroup to give weight to these items.

For example, the table entries

	# Group         Usefulness   Size   Age
	rec.music	        15      3     5
	rec		        10      1     1

This says that rec.music.* rates better than the other rec groups,
but that the newsgroup should be smaller, and the articles
become useless pretty fast.

Comments?
-- 
David Elliott
dce@smsc.sony.com | ...!{uunet,mips}!sonyusa!dce
(408)944-4073
"But Pee Wee... I don't wanna be the baby!"

davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (12/28/89)

  I set my threshold quite high and leave a good bit of space when I
stop uncompressing news. On a regular basis I check the space in the
spool partition and if it is getting tight enter a loop like so:

	read the expiration (-e) time for high volume and
	low usefulness groups

	read the names of the groups in these categories

	while `not enough space' and `expiration > 0'

	  expire the high volume groups

	  expire the high noise groups (ie. alt.flame, etc)

	  decrease the expiration time by one day

	  check the space again

	# end loop

  Since this doesn't happen very often, and the expire can make space
faster than news can come in (this is the trick), I have no problems.
The check usually runs about 700ms of CPU when all is normal, so I hit
it every half hour. It's inefficient when it starts working, but once or
twice a month I can stand it.
-- 
	bill davidsen - sysop *IX BBS and Public Access UNIX
davidsen@sixhub.uucp		...!uunet!crdgw1!sixhub!davidsen

"Getting old is bad, but it beats the hell out of the alternative" -anon

woods@robohack.UUCP (Greg A. Woods) (12/28/89)

In article <1989Dec27.033817.9953@smsc.sony.com> dce@smsc.Sony.COM (David Elliott) writes:
> The "removability" of a file would be a function of newsgroup
> name, newsgroup size, and file age.  One might use a formula
> like:
> 
> 	removability = (X*size^2 + Y*age^2) - usefulness(newsgroup)
> 
> The "usefulness" function would be a table of constants supplied
> by the administrator.  The values of X and Y are supplied for
> each newsgroup to give weight to these items.
> 
> For example, the table entries
> 
> 	# Group         Usefulness   Size   Age
> 	rec.music	        15      3     5
> 	rec		        10      1     1

This is similar to some ideas I had recently.  Your article has
inspired me to put my thoughts on paper, so to speak:

I would rather still have expire do the expiring, rather than rnews.
This allows more flexibility, not to mention archive support, etc.  I
would definitely not want relaynews to do expiring too!

Your usefulness field would be a factor, between 0 and MAXINT, used to
prioritize newsgroups.

The size field would be the desired number of articles to be kept in
the spool.  This number would be decremented, taking into account the
usefulness factor, if space was really tight.

Expire would still pay attention to the Expires: header, with the same
three value control field as it currently has, in place of your
suggested age field.  The 'retention' value would have highest
priority, but with the usefulness factor applied if space was really
tight.  The 'normal' value would be of lower priority than size, and
if null the Expires: header would be followed explicitly, unless the
'purge' date over-rides it.  The 'purge' value would also outweigh both
usefulness and size, but could of course be left null, or set quite
large.

In addition, expire would be given a goal of free space) to be
achieved.  (i.e. a '/freespace/' line like '/history/'.)  Expire would
still use spacefor to determine its success.

Expire would then become a multi-pass process, but I don't think this
would impair its speed much.  In order to enhance performance, I would
place the article byte size in the history file (though block size
would be more useful).  Since all cross-references are already noted
by newsgroup, it is very easy to calculate the potential gain if an
article is expired, while keeping in mind the various explist control
lines for the article.  There could even be a flag to determine the
effect on cross posted articles.  Either the quickest, or the longest,
expire could be used for all links, or each link could be expired
separately, with space gained only upon expiration of the last link.

In case rnews runs out of space in spooling incoming news, it can
simply wait for space, as it normally does.  I currently have the
newswatcher script run hourly and it runs an emergency expire when
space becomes tight.  For now I have a series of expire scripts which
are run in sequence until sufficient space is freed.  With a goal
oriented expire, this would be unnecessary, and indeed an emergency
expire would only be required during news floods.  Of course there
must be sufficient space in your spool directory for incoming uucp
jobs while expire runs.  I am always careful to isolate
/usr/spool/news, and I usually have a separate /usr/spool/uucp, and if
not, at least a separate /usr/spool.

Also on the disk space vs. news issue, I've been thinking of changes
that would be nice in spacefor and it's users in order to have finer
control in identifying space in in.coming, news spool, out.going, uucp
spool, etc.
-- 
						Greg A. Woods

woods@{robohack,gate,tmsoft,ontmoh,utgpu,gpu.utcs.Toronto.EDU,utorgpu.BITNET}
+1 416 443-1734 [h]   +1 416 595-5425 [w]   VE3-TCP   Toronto, Ontario; CANADA

dce@smsc.sony.com (David Elliott) (12/29/89)

In article <1989Dec28.063932.13720@robohack.UUCP> woods@robohack.UUCP (Greg A. Woods) writes:
>I would rather still have expire do the expiring, rather than rnews.
>This allows more flexibility, not to mention archive support, etc.  I
>would definitely not want relaynews to do expiring too!

Actually, I was thinking more in terms of having newsrun doing the
expiring as part of its loop.

The big problem as I see it is that expire is slow (at least the B
news version was), especially if you start adding special heuristics
based on usefulness and group size and file age and number of
subscribers and so forth.

If expire generated a list of files to expire once a day, you could
still archive the files, and maintain flexibility, but when it's time
for them to go to make room for other files, it's easy and fast,
and until that time comes, they're still available.

-- 
David Elliott
dce@smsc.sony.com | ...!{uunet,mips}!sonyusa!dce
(408)944-4073
"But Pee Wee... I don't wanna be the baby!"

brad@looking.on.ca (Brad Templeton) (12/29/89)

Back when I had smaller disks, I ran a spaced based expire that I wrote.

I had the cron wake up every 15 minutes and run it if the amount of free
disk space got too low.

Space based expire *is* the way to do it.   Particulary if you ever
get things like news stoppages lasting a day, or high-volume days with
big fat binaries and source distributions.

There is a certain elegance to inews doing the expire, by using a
list of 'next to go' articles that is created every night by a background
expire program.   This deals with batching well.

But the same result can probably come from having inews simply record how
much space it has used since the last check, and having an hourly program
do an expire of that much space, resetting the count.

Either way time based expire is a loser.  The purpose of expire is to
keep down the amount of disk space (and sometimes inodes) used by
news, isn't it?
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

henry@utzoo.uucp (Henry Spencer) (12/29/89)

In article <1989Dec28.171830.13130@smsc.sony.com> dce@Sony.COM (David Elliott) writes:
>>I would rather still have expire do the expiring, rather than rnews.
>>This allows more flexibility, not to mention archive support, etc.  I
>>would definitely not want relaynews to do expiring too!
>
>Actually, I was thinking more in terms of having newsrun doing the
>expiring as part of its loop.

Folks have done that with C News, although it's not something we support
officially.  Possibly we should, but the obvious technique -- dynamically
generating expire's control file and cranking down the numbers until space
is adequate -- interacts awkwardly with some of the fancier things you
can do in the control file.  If I can think of some graceful way to deal
with this, I'll probably make it available as an option.

>The big problem as I see it is that expire is slow (at least the B
>news version was), especially if you start adding special heuristics
>based on usefulness and group size and file age and number of
>subscribers and so forth.

C News expire is essentially entirely I/O-bound and dbm-bound (I haven't
yet run detailed timings with dbz, although I'll do it soon), so adding
a *little* complexity to the decision process would not be disastrous.

We were very close to adding the size of the file as another subfield
in the history file's middle field, so that it could be used as input
for decision making.  Alas, it's *not* easy to define exactly how such
policies should work in the presence of complications like per-group
expiry settings, and we tend to believe in the theory that you should
not collect data until you have some idea what you're going to do with it.

>If expire generated a list of files to expire once a day, you could
>still archive the files, and maintain flexibility, but when it's time
>for them to go to make room for other files, it's easy and fast,
>and until that time comes, they're still available.

I thought a bit about breaking expire into a decision part and an
implementation part, so to speak, like this.  I wasn't convinced that
it offered enough advantages to be worth the effort and possible
problems.  *However*... note that expire's -t option does almost exactly
what the decision module would do:  it prints a description of what
expire would do, but doesn't do it.  The output is *almost* an executable
shell file -- at one point it was one, until I noticed that there are some
complications like creating directories that are hard to deal with simply --
and picking out the file names would not be hard.  I will write up the
format in the documentation, so folks can depend on it.
-- 
1972: Saturn V #15 flight-ready|     Henry Spencer at U of Toronto Zoology
1989: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

brad@looking.on.ca (Brad Templeton) (12/29/89)

I found using the size of a file as a parameter to be useful.

My expire assigned a score to each article.  The base of the score was
the age of the file in seconds (I ignored any explicit expiry date -- I didn't
want outsiders deciding how long their pearls of wisdom would stay on my
system when I only had 3000 blocks free, thank you.)

In any group, one could add to the score based on the group, so that some
groups hung around longer than others.

In addition, I did set it so that the size of the file (multiplied by
a constant of your choice) was added to the score.

The scores were then sorted, and the files with the highest scores were
removed until enough disk space had been freed -- or rather until the
remaining disk space was my fixed allocation for news.

Adding the size made it the case that one really big article would go
instead of a dozen small ones.  This kept the average number of
days of articles kept higher than it would have been otherwise.

So Henry, there's one point of data.   I would post this very simple
expire here if people want it -- it's quite short -- but there's a lot
it doesn't do.  For one, it ignores the history file all together, and
just gets ages from stat().  It doesn't update the database or the
active file.  That means you need to run expire -rebuild every few days
to get things back in mesh.   This program controls the disk space
problem and lets the real expire keep track of the databases and active
file.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/29/89)

My rnews.c that was recently posted does progressive expires if there isn't
enough space.  It works quite well.

I agree that you could come up with something that would be more efficient,
although it's not that easy to eliminate doing multiple passes through the
history file (which is all expire -r does).

-- 
Jon Zeeff    	zeeff@b-tech.ann-arbor.mi.us  or b-tech!zeeff

storm@texas.dk (Kim F. Storm) (12/30/89)

woods@robohack.UUCP (Greg A. Woods) writes:

>There could even be a flag to determine the
>effect on cross posted articles.  Either the quickest, or the longest,
>expire could be used for all links, or each link could be expired
>separately, with space gained only upon expiration of the last link.

I cannot see what benefits this should give you?

Either you expire an article because disk-space is sparse (or due to
some other resource related policy), or you keep the article.

The problem with your idea is that it makes the "Newsgroups:" line
unreliable which may fool some news readers (users and software :-)
who will say - oh, I will see this article again later in group XYZ
(so I wont read it now), or - oh, I already saw that article in group
ZYX (so I can skip it).

The only reason I can think of for doing what you suggest would be to
make the improved expire run faster compared to what it has to do if
calculating the "combined" usefulness of the article.  I really don't
believe you can save any significant time on this "hack", and I
therefore fail to see that the inconsistency that would be imposed by
this method can be justified by the marginal time savings on expire
(and those savings may be more than wasted on rewriting the history
file to reflect the narrowed set of groups in which the article occur).

An easy rule to calculate the combined usefulness of an article would be
the maximum usefulness of the article in any of the groups to which it
is cross-posted.  This will mean that if an article is important
enough to be kept in one group, it is important enough to keep in all
its groups.

-- 
Kim F. Storm        storm@texas.dk        Tel +45 429 174 00
Texas Instruments, Marielundvej 46E, DK-2730 Herlev, Denmark
	  No news is good news, but nn is better!

henry@utzoo.uucp (Henry Spencer) (12/30/89)

In article <68634@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>Space based expire *is* the way to do it...
>Either way time based expire is a loser.  The purpose of expire is to
>keep down the amount of disk space (and sometimes inodes) used by
>news, isn't it?

"Keep down" does not mean "strictly bound".  Given constraints on things
like resource consumption, and a user preference for predictable behavior,
it's not obvious that time-based expire is bad.  Much depends on details
of the system's environment.
-- 
1972: Saturn V #15 flight-ready|     Henry Spencer at U of Toronto Zoology
1989: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

davecb@yunexus.UUCP (David Collier-Brown) (12/30/89)

>In article <68634@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>>Either way time based expire is a loser.  The purpose of expire is to
>>keep down the amount of disk space (and sometimes inodes) used by
>>news, isn't it?

henry@utzoo.uucp (Henry Spencer) writes:
[...]					 Given constraints on things
>like resource consumption, and a user preference for predictable behavior,
>it's not obvious that time-based expire is bad.

    To expand on the above a bit, news has never been a well-behaved user of
space, strictly because of the temporal dimension... News is trying to
present information in a timely fashion, keep it around until the majority
of readers have a chance to read (and save) it and then discard it as
"old news".  This is hard.  In the multi-site case, the delays make it
**very** hard.
    We approximate the discarding of news after use by the expirey scheme,
which is really trying to do two things 
	1) recover space (News appears to think it runs on an
	   infinite-disk-machine (:-))
	2) provide a simple rule to its client base: for example, "you must
	   read category c in one week, category s in 5 days and the rest
	   daily, or you will miss material".

    The parameterization in expire reflects the author's desire to have
the local site set the policy it needs, or not as the case may be. Excessive
concern with space (ie, an implementation problem) can cause the behavior
of the system to be mysterious and unpredictable to its users.

    Regrettably, the news flow variability tends to crash up against disk
limits a lot, making a time-based expire dangerous: with older news systems
news was simply lost when one got a burst that overflowed your disk.
[Something with which I am all to familiar].

    This leaves us with a contradiction: we have two needs, both quite
real, which draw us in opposite directions.  The reader needs the illusion
of reliability and regular expiery limits.  The system needs to trade off
space against flow.
    This tends to make an elegant solution hard.

    My best attack on the problem is to define a hierarchy of requirements,
and satisfy them in order:
     1) news shall not drop articles on the floor [0]
     2) articles shall be kept around for not less than the
	"standard" time to forward them to directly-connected
	systems, plus a safety factor [1+3]
     3) articles in groups which are NOT being read locally shall be
	available for a period of time sufficient to allow a new subscriber
	to find on or more articles in the group, so they will not mistake
	the group as inactive. [1+x, x defined by mean time between messages]
      4) articles in groups which are being read locally shall be kept for
	a period known to the readership, shall disappear soon after that 
	time and are in general unrecoverable after they disappear. [1+y]

  This implys one can usefully probe user's .newsrc files to see if groups
belong in category 3 or 4, but will have to deal in policy to make other
decisions:
     a) What groups do you send & recieve, and how much space must
	you have just for transfer, plus packing and handling. (Indeed,
	must you have a separate uucp spool...).  What agreements about
	new hierarchies & groups do you have with your feeds.
     b) What groups and hierarchies do you provide locally.  (Why.)  What is
	your minimum residence time.  What minimum amount of space must you
	provide for them, if all were considered "unread".  
     c) What groups/hierarchies are currently read.  What additional space is
	required per day of residency.
     d) What is the expected increase in volume and readership per year.
	What does that do to all of the above.
     e) Do you have categories of groups (ie, comp vs talk).  What is your
	criteria for this categorization.  What will changes in category
	cost in space.

  So most of the questions are non-technical... and less than exciting to
consider.

   At the technical level (as I implied before), the best model I can suggest
is paging, with expire (the reaper!) putting the messages on the deletable
list based on as complex a criteria set as you'd like, the news inspooler
(space user-upper) trashing them to make room for unpacked new articles,
and a optional rescuer grabbing them back if they are re-referenced later.
[This last is a gut-feel speculation on my part].

--dave (out of time to write & ideas, simultaneously) c-b
-- 
David Collier-Brown,  | davecb@yunexus, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | Joyce C-B:
CANADA. 416-223-8968  |    He's so smart he's dumb.

brad@looking.on.ca (Brad Templeton) (12/31/89)

Perhaps the idea is to break up expire neatly into two parts.  One prepares
the list of articles to go according to whatever criterion, and the other
removes them, updates the database and active file etc.


The removing part could be part of inews, or an independent program.

I haven't done anything, but it seemed to me that if you want a really fancy
expire, something like a newsclip program might make an interesting front
end.  You could keep or expire or weight articles based on anything -- from
what group they're in, to who posted them, to what thread they're in to
whether they contain patterns.

Of course, it needn't be newsclip, it could be any scanning program, from
those that just read the history file to anything else you want to code.

The bad part is that this of course requires reading every article, which was
the slow thing about B news expire.   Fortunately, once you calculate the
score for an article, the only thing that affects it is the passage of time,
so you could arrange to keep a history like file with calculated scores,
the time they were calculated, and the multiplier to be used when adding
the time since to get the final score.   In this case, you only have to
scan new articles and add to the file.


This is not enough, however.  The program that sorts and decides who is
first to go needs some smarts beyond this.  The simplest thing to do is
just keep the N blocks of articles with the highest (lowest?) scores.

But if you want to get fancy, you might want to instead assign Y blocks
per group.  ie. news.software.b always keeps 300K, talk.bizarre always
keeps 150K, etc.   You might also want to keep a fixed number of articles
in a group. -- ie. keep 10 articles in groups that nobody currently reads,
or keep a minimum of 5 articles in any group, even if it's super-low volume.

All in all, a messy problem...
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

dce@smsc.sony.com (David Elliott) (12/31/89)

In article <6118@yunexus.UUCP> davecb@yunexus.UUCP (David Collier-Brown) writes:
>	2) provide a simple rule to its client base: for example, "you must
>	   read category c in one week, category s in 5 days and the rest
>	   daily, or you will miss material".
...
>      4) articles in groups which are being read locally shall be kept for
>	a period known to the readership, shall disappear soon after that 
>	time and are in general unrecoverable after they disappear. [1+y]

I agree with most of David's points, but I think that "will" and "shall"
in the above should be softened by adding "probably".

That is, people should know that news will be around for no less than
the given expiry date, and it might be around after, but should not be
counted on.  This is intimated in the statement "general[ly]
unrecoverable".

One thing I wonder about is the mechanism to use for grabbing the
subscriber info.  You can't rely on .newsrc being used or being
available. On our network, for example, people read news using NFS, and
they may not even have accounts on the main news machine (they only
need it to post).  Of course, we also have people who don't like the
idea of being in a network, so they read news on the main news
machine.  In other words, I don't even have a good set of rules to
follow.
-- 
David Elliott
dce@smsc.sony.com | ...!{uunet,mips}!sonyusa!dce
(408)944-4073
"But Pee Wee... I don't wanna be the baby!"

greenber@utoday.UUCP (Ross M. Greenberg) (12/31/89)

Howzabout going through everyone's .newsrc, determining what groups are not
being read by anyone and expiring them with extreme prejudice - nobody would
notice, really.  Next, expire articles that everybody has already read, 
starting with the least popular newsgroups (by frequency in .newsrc) and
heading towards the most popular last.

Only problem:  some user who consistantly opts to not read a given newsgroups
but doesn't mark it as read, either.


-- 
Ross M. Greenberg, Technology Editor, UNIX Today!   greenber@utoday.UUCP
             594 Third Avenue, New York, New York, 10016
 Voice:(212)-889-6431 BIX: greenber  MCI: greenber   CIS: 72461,3212
  To subscribe, send mail to circ@utoday.UUCP with "Subject: Request"

tale@cs.rpi.edu (David C Lawrence) (12/31/89)

In article <1120@utoday.UUCP> greenber@utoday.UUCP (Ross M. Greenberg) writes:
> Only problem:  some user who consistantly opts to not read a given newsgroups
> but doesn't mark it as read, either.

That is not the only problem.  Some people like to stay unsubscribed
from groups but look in on them when they have some extra time.
Additionally, there are those times when I see mention of an article
in a group to which I am unsubscribed but it nevertheless interests
me.  Both of these scenarios are affected by the proposed expiry method.

Dave
-- 
   (setq mail '("tale@cs.rpi.edu" "tale@ai.mit.edu" "tale@rpitsmts.bitnet"))

brad@looking.on.ca (Brad Templeton) (12/31/89)

Yes, if you kill (or don't feed) groups that nobody reads, then you lose
the ability to resubscribe and have articles present.

But so what?  If saving disk space is important, then this is a small price
to pay.  It's a cute feature, but not worth megs of disk.

And if you have gigs to spare, you don't have to play around with fancy
expire tricks.

I liked Eric Raymond's idea the best.  Scan all the .newsrc files.  'and'
together the 'read' bits.  Expire those articles marked read.

Thus once everybody's read it, it's history.  Means you can't go back, but
it also means you can save a *lot* of disk space.

A slightly more relaxed scheme would and all the 'read' bits and queue the
result for deletion in N days.  Article stay N days after everybody has
read them.  N could vary by group.  In addition, roots of big trees might
stick around.

Alternately, compress and archive all articles that have been read by
everybody.  Cut news disk space more than in half.

Hard to do this with NNTP, though.  But on a machine where disk space is
important, like a PC, it's the way to go.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

woods@robohack.UUCP (Greg A. Woods) (12/31/89)

In article <432@texas.dk> storm@texas.dk (Kim F. Storm) writes:
> woods@robohack.UUCP (Greg A. Woods) writes:
> 
> >There could even be a flag to determine the
> >effect on cross posted articles.  Either the quickest, or the longest,
> >expire could be used for all links, or each link could be expired
> >separately, with space gained only upon expiration of the last link.
> 
> I cannot see what benefits this should give you?
> 
> Either you expire an article because disk-space is sparse (or due to
> some other resource related policy), or you keep the article.

And since inodes are also a resource, this is a resource conserving
option.  On many machines it would not be difficult to have a
partition which would have lots of free blocks, and no free inodes if,
for example, the average article size dropped from 3Kb to 1Kb.

I also came to like this option for some of the reasons opposite to
those you mentioned.  I read news on what is primarily a single user
site (this, my home machine).  Some groups have had a significant
amount of cross posting.  Since I might want some groups to disappear
faster than others, but not those articles which appeared in a more
interesting group , I might choose the "longest" option.  On the other
hand, the cross postings might be mostly noise, as in unix-pc to
comp.sys.att.  In this case I might want to choose the "quickest".

Since I beleive that a shift of the majority of news readers to
smaller machines, with fewer fellow readers is happening, personal
control over expiry will have many distinct advantages for an
increasing number of people.
-- 
						Greg A. Woods

woods@{robohack,gate,tmsoft,ontmoh,utgpu,gpu.utcs.Toronto.EDU,utorgpu.BITNET}
+1 416 443-1734 [h]   +1 416 595-5425 [w]   VE3-TCP   Toronto, Ontario; CANADA

smaug@eng.umd.edu (Kurt Lidl) (12/31/89)

In article <69654@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
[...]
>I liked Eric Raymond's idea the best.  Scan all the .newsrc files.  'and'
>together the 'read' bits.  Expire those articles marked read.

Tough to do with a setup like ours -- about 250 readers, based on 8
different fileservers, with one common NNTP news server.
Just getting to all the .newsrc's to read them is a small challange.

>Thus once everybody's read it, it's history.  Means you can't go back, but
>it also means you can save a *lot* of disk space.

This also means that you cannot go back and retreive an article that you
have passed by.

>A slightly more relaxed scheme would and all the 'read' bits and queue the
>result for deletion in N days.  Article stay N days after everybody has
>read them.  N could vary by group.  In addition, roots of big trees might
>stick around.

This is a good idea.  But I'm not sure how hard it would be to implement.

>Hard to do this with NNTP, though.  But on a machine where disk space is
>important, like a PC, it's the way to go.

This is very, very true.  I have a hard enough time just trying to get
some sort of statistics on how many read what groups in our setup.

>Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
--
/* Kurt J. Lidl (smaug@eng.umd.edu) | X Windows: Power Tools */
/* UUCP: uunet!eng.umd.edu!smaug    | for Power Fools        */

brian@radio.astro.utoronto.ca (Brian Glendenning) (01/01/90)

If you're going to do something like this you need to put the .newsrc
format in some rfc. At least one newsreader (Gnews) does not use
.newsrc files, it uses .gnewsrc files that are full of emacs lisp
stuff.

--
	  Brian Glendenning - Radio astronomy, University of Toronto
brian@radio.astro.utoronto.ca uunet!utai!radio!brian  glendenn@utorphys.bitnet

greenber@utoday.UUCP (Ross M. Greenberg) (01/01/90)

In article <`QF52&@rpi.edu> tale@cs.rpi.edu (David C Lawrence) writes:
>That is not the only problem.  Some people like to stay unsubscribed
>from groups but look in on them when they have some extra time.
>Additionally, there are those times when I see mention of an article
>in a group to which I am unsubscribed but it nevertheless interests
>me.  Both of these scenarios are affected by the proposed expiry method.
>

Although I can appreciate that, at some point an SA simply has to draw the
line and say "Not yet, sorry" when somebody wants a group.  As SA at utoday,
I recently pulled the plug on some high vol newsgroups due to disk space
considerations.  One person here complained, and I'll be reinstating that
group as soon as I get the disk space problem resolved.

Nobody said the job of SA was gonna be easy -- that's why we get the big
bucks!  :-)

-- 
Ross M. Greenberg, Technology Editor, UNIX Today!   greenber@utoday.UUCP
             594 Third Avenue, New York, New York, 10016
 Voice:(212)-889-6431 BIX: greenber  MCI: greenber   CIS: 72461,3212
  To subscribe, send mail to circ@utoday.UUCP with "Subject: Request"

frank@ladc.bull.com (Frank Mayhar) (01/01/90)

In article <1989Dec30.212935.1570@smsc.sony.com> dce@Sony.COM (David Elliott) writes:
>One thing I wonder about is the mechanism to use for grabbing the
>subscriber info.  You can't rely on .newsrc being used or being
>available. On our network, for example, people read news using NFS, and
>they may not even have accounts on the main news machine (they only
>need it to post).  Of course, we also have people who don't like the
>idea of being in a network, so they read news on the main news
>machine.  In other words, I don't even have a good set of rules to
>follow.

Well, if you're talking about doing it *right*, one approach you could take
would be to centralize the subscriber/subscription list.  That is, keep all
the .newsrc files in one place, for example in /usr/lib/news/subscribers, in
the form of "<subscriber name>.newsrc" or something.  Keep a copy of it in the
subscriber's $HOME directory, force them to match whenever he runs his news
reader.  For NNTP readers, the subscriber file could be something like
"machine:<login>.newsrc", and a new NNTP server command to get the subscriber's
.newsrc.

Certainly this would require some changes to the news readers, and to NNTP.
But I think it would be worthwhile, and not too difficult to implement.
-- 
Frank Mayhar  frank@ladc.bull.com (..!{uunet,hacgate,rdahp}!ladcgw!frank)
              Bull HN Information Systems Inc.  Los Angeles Development Center
              5250 W. Century Blvd., LA, CA  90045    Phone:  (213) 216-6241

moraes@cs.toronto.edu (Mark Moraes) (01/01/90)

Um, not everyone runs the same news configuration - on some sites, it
is impossible to find out all the .newsrc files -- our news machine is
server for close to 100 machines in this building, (maybe more -- I
haven't an easy way of telling:-) all of which NFS-mount /news.  (We
use NFS for reading news, NNTP for posting news, and a continuously
running NNTP for exchanging news:-) Many of the machines that NFS
mount the partition are in ADMINSTRATIVELY separate domains.  News
maintainers do not have root, occasionally do not have accounts on all
subscriber machines.

I don't think we want to force every newsreader to be written to have
a .newsrc either. (eg. I have a news scanning script that uses it's own
files to keep track of what news it has scanned/forwarded. Many of the
newsgroups it scans are not in my .newsrc) Or even worse, complicate
the already, er, convoluted internals of most newsreaders to provide
central subscriber lists.

For us, centralizing .newsrcs is technically hard (we prefer less
interdependency between our servers, not more) and politically
impossible.  The idea of putting any more load on our considerably
overloaded news machine would not go over well -- a lot of effort has
been put into trimming wasted CPU on that machine to keep performance
bearable.  Yes, I know, that's our problem. But I suspect we're not
alone in running news on machines that have to perform other duties
(Real Work) to earn their keep. It's much simpler to run pessimistic
time-based expires on newsgroups that we consider less than vitally
important.

	Mark
---
"It's only netnews" -- Geoff Collyer, loosely paraphrasing Peter Honeyman.

tale@cs.rpi.edu (David C Lawrence) (01/01/90)

In article <89Dec31.171430est.2251@neat.cs.toronto.edu> moraes@cs.toronto.edu (Mark Moraes) writes:
> For us, centralizing .newsrcs is technically hard (we prefer less
> interdependency between our servers, not more) and politically
> impossible.
[ And other stuff about how it isn't such a hot idea at his site.] 

Indeed.  This site is quite the same way; nevertheless the model is
acceptable for many other sites on the net.  I avoided bringing up the
issue that it won't work with sites of our nature because that isn't
entirely relevant.  If a lot of people can benefit from expiry of the
nature proposed, then it is useful work in spite of the fact that it
isn't useful to us.  The mistake, of course, would be to make this the
only way expiry could be done.  I don't recall seeing anyone make such
a ludicrous suggestion as that though -- besides, the distributed
sites could always just keep what we've got now. :-)

Dave
-- 
   (setq mail '("tale@cs.rpi.edu" "tale@ai.mit.edu" "tale@rpitsmts.bitnet"))

brad@looking.on.ca (Brad Templeton) (01/01/90)

What would be really useful would be to define an extensible .newsrc format,
just as the header format is extensible.

There are many thinks I wanted to add to the .newsrc.  So did RN.  But you
can't.  Just the options line.  So we get Rn's last and soft files etc. and
my .newsrclas file.

Let's define an extensible format, hack rn and readnews to understand it,
and then everybody can use it.

Before doing that it might be a good idea to consider if the 1-10,12,30-40
style is the best.  It is a bit cumbersome, and requires memory re-allocs
in the software.   But it is reasonably compact for a non-binary format.


Some extensions I have in mind are:
	a) RN wants to keep pointers into the active file
	b) I want to keep a 'last article filtered' counter.  (if you do
		complex filtering on a high-volume group, you want your
		filter to run in the background, and have your reader only
		show you articles your filter has processed)
	c) Various folks would like flags on groups, pointers to files or
		options associated with the group.
	d) Eric was going to put message-ids to kill in the .newsrc
	e) readnews puts its own options there
	f) There might be more options than subscribed and unsubscribed.
	   For example, filtered.

I am sure people can think of others.  Which is why you need an extensible
format.

Perhaps something simple like:

groupname[:!] [fieldname=data;]*

with fields delimited by something like colons or semicolons, and the
default field (ie anything starting with a digit) is the 'seen articles list'
-- thus degenerating to the current format.   Leave : and ! as the delim
after the group name, but add extra fields for other kinds of subscription.

-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

dricejb@drilex.UUCP (Craig Jackson drilex1) (01/02/90)

Might I point out that those sites with 100 machines mounting /news, or 100
machines accessing news via nntp, aren't going to benefit from newsrc-based
expiration anyway?  In a population as large as 100, the union of all
newsgroups-subscribed-to is likely to be large. Also, somebody is probably
out of town, so they're way behind in that group.

It would seem to me that newsrc-based expiration would really only be 
interesting when there are fewer than 20 readers or so.

As for the issue of NFS, NNTP, and wierd newsreaders: if 'newsrc' based
expiration is desirable, it may be necessary (and useful) to mandate
an additional means for indicating subscription and 'read' status.  SMOP :-)...
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

woods@eci386.uucp (Greg A. Woods) (01/02/90)

In article <1989Dec31.083610.10649@robohack.UUCP> woods@robohack.UUCP (Greg A. Woods) writes:
> In article <432@texas.dk> storm@texas.dk (Kim F. Storm) writes:
> > woods@robohack.UUCP (Greg A. Woods) writes:
> > >Either the quickest, or the longest,
> > >expire could be used for all links, or each link could be expired
> > >separately, with space gained only upon expiration of the last link.
> > 
> > Either you expire an article because disk-space is sparse (or due to
> > some other resource related policy), or you keep the article.
> 
> And since inodes are also a resource, this is a resource conserving
> option.

Open mouth, insert foot.  OOPS.  My reasoning and deduction logic
seems to have skipped a step.  For some reason I'd forgotten that
links only require a directory entry, and as such it'll take quite a
few cross-post deletions, and a very smart filesystem, and a sudden
halt in news flow before you'll ever gain any disk space!  This was
kindly pointed out to me in mail by Brad Templeton.

The other reasons I mentioned are still relevant for those of us who
don't really like using kill files, or can't.  Kind of like having a
truely global kill file!  But I can see my argument for this option
slowly breaking down....

I would also like to point out that having newsrun do the expire's is
not of much help for those of us who run rnews.immed.  By the time
newsrun gets going, it's too late.  Besides, having the input
handler's manage disk space is confusing the functionality.  If you
have space problems, use newswatch to look out for such conditions and
do something about them.  That's what (I assume) it's for.
-- 
						Greg A. Woods

woods@{eci386,gate,robohack,ontmoh,tmsoft,gpu.utcs.UToronto.CA,utorgpu.BITNET}
+1-416-443-1734 [h]  +1-416-595-5425 [w]    VE3-TCP	Toronto, Ontario CANADA

amanda@mermaid.intercon.com (Amanda Walker) (01/03/90)

In article <69903@looking.on.ca>, brad@looking.on.ca (Brad Templeton) writes:
> Let's define an extensible format, hack rn and readnews to understand it,
> and then everybody can use it.

There seems to be an underlying assumption here.  Fewer and fewer people
are using rn, readnews, or even UNIX-based news readers.  Rather than
trying to infer information from an ever-muddier environment by rooting
through .{news,gnus,gnews,...}rc files, maybe a better approach would be
to make it explicit.  Once you start introducing news reading via NFS, NNTP,
PCMAIL, or whatever, the difficulty of picking up readership information
"for free" starts to skyrocket.  Horsepower-poor sites, which are where
the biggest crunches are occuring, are exactly the same sites that are
most likely to start distributing the load via the approaches above, and
thus will have the hardest time picking up readership information for
free.  Look at the arguments about Arbitron's accuracy these days...

Maybe news reading needs to become more transaction oriented; I don't
know.  Various people have done hacks to NNTP that show that it's at
least a fruitful approach, and it keeps things simple by not requiring
news *reading* software to store and maintain information required by
the database *maintenance* software (inews, expire, et al.).  Keeping
the two separate is a good thing, IMHO.  Saves headaches all around.

But, if you really want to define an extensible format, how about not
reinventing the wheel too much--something like a printable-ASCII (so you
can edit it with a text editor if necessary), easily-parsable block
structured thing, such as a printable version of ASN.1.  This way, programs
can skip over new things that they don't know about, without having to
be recompiled every time somebody adds Yet Another Flag or Option.

$.02,

Amanda Walker
InterCon Systems Corporation
--

woods@robohack.UUCP (Greg A. Woods) (01/03/90)

In article <7171@drilex.UUCP> dricejb@drilex.UUCP (Craig Jackson drilex1) writes:
> Might I point out that those sites with 100 machines mounting /news, or 100
> machines accessing news via nntp, aren't going to benefit from newsrc-based
> expiration anyway?  In a population as large as 100, the union of all
> newsgroups-subscribed-to is likely to be large. Also, somebody is probably
> out of town, so they're way behind in that group.
> 
> It would seem to me that newsrc-based expiration would really only be 
> interesting when there are fewer than 20 readers or so.

And here, where there are fewer than 10 readers, one of them being me,
I don't want to have newsrc driven expires.  I want a space based,
goal driven, expire!  0.25 :-) I think such a beast would also be quite
useful for both large and small sites.  In looking at the references
line I'd guess that only 1/2 of the participants in this thread are on
systems running C News.  I think this shows that the problem is with
the basic idiom behind the expire control file, not with any
particular expire.

I think all of this discussion has been interesting, but it has
wandered far from what seems practical and feasible, at least in the
short term.  Once we get expire to work in the way that we (I) have
been thinking about news expiry since day one, then maybe we can think
of ways to control this new expire to suite the local culture.

Since there was such a volume of discussion on this topic, I will
assume that I'm not the only one not happy with the current state of
affairs.  Since I have also spent some time inside C News, (working on
porting, installation features, and tuning, since the alpha version),
I will think about implementing a scheme similar to what I discribed.
I won't guarantee I'll get anywhere, as I have several dozen projects
on the go now, but I'll try.  If anyone has any really terrific ideas
for a space based, goal driven, expire, let me know.  If anyone is
already doing this, please let me know.
-- 
						Greg A. Woods

woods@{robohack,gate,tmsoft,ontmoh,utgpu,gpu.utcs.Toronto.EDU,utorgpu.BITNET}
+1 416 443-1734 [h]   +1 416 595-5425 [w]   VE3-TCP   Toronto, Ontario; CANADA

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (01/03/90)

Given that you can't count on there being .newsrc files and you don't 
want to modify the news readers, a remaining option is to have a 
program that watches the access and modification times of the articles 
and gradually learns what groups are being read.  It's not to hard to 
determine that if an article has been around for many days and the 
access_time = mod_time then it's likely that no one is reading the 
group.  

-- 
Jon Zeeff    	zeeff@b-tech.ann-arbor.mi.us  or b-tech!zeeff

henry@utzoo.uucp (Henry Spencer) (01/04/90)

In article <1990Jan2.152917.15117@eci386.uucp> woods@eci386.UUCP (Greg A. Woods) writes:
>I would also like to point out that having newsrun do the expire's is
>not of much help for those of us who run rnews.immed.  By the time
>newsrun gets going, it's too late...

Well, not necessarily.  If you are running with small or zero margins,
then yes, you're in trouble if you blow them even slightly... but with
substantial and well-chosen margins (notably, "articles" margin less than
"incoming" margin, so that newsrun notices trouble before rnews starts
throwing away files), it still makes sense.

>Besides, having the input
>handler's manage disk space is confusing the functionality...

Disk space is one of those ugly global issues that really has to be
everybody's job.  The "right" solution is just to have enough reserve
space that nobody ever has to worry about it, but many systems don't
have that luxury.

>If you
>have space problems, use newswatch to look out for such conditions and
>do something about them.  That's what (I assume) it's for.

Actually, newswatch was motivated by the discovery that since C News
stuff is very patient about waiting for locks, a locking problem could
go unnoticed for a long time.  However, using it to keep an eye on space
problems is not unreasonable.
-- 
1972: Saturn V #15 flight-ready|     Henry Spencer at U of Toronto Zoology
1990: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

amanda@mermaid.intercon.com (Amanda Walker) (01/04/90)

In article <NVHHF|@b-tech.uucp>, zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff)
writes:
> It's not to hard to 
> determine that if an article has been around for many days and the 
> access_time = mod_time then it's likely that no one is reading the 
> group.

Now that's a nice idea.  I like it.  Since every news reader, whether local
or NNTP, has to actually read the article file at some point, this shouldn't
either break existing readers or be broken in turn by news ones.  Of course,
you only get a "read/not read" result, not a measure of how popular a group
is, but then again expiration policy should not necessarily be tied directly
to popularity.

Amanda Walker
InterCon Systems Corporation
--

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (01/04/90)

>want to modify the news readers, a remaining option is to have a 
>program that watches the access and modification times of the articles 
>and gradually learns what groups are being read.  It's not to hard to 

Before someone points this out to me, I do realize that you have to account
for other accesses to the articles (eg. outgoing feeds).

-- 
Jon Zeeff    	zeeff@b-tech.ann-arbor.mi.us  or b-tech!zeeff

brad@looking.on.ca (Brad Templeton) (01/04/90)

Nice idea, but lots of programs run around accessing articles.  Old
expire for one.  And anybody who decides to do a search of the whole
News database.  (Which I do with newsclip programs from time to time.)
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (01/05/90)

>for a space based, goal driven, expire, let me know.  If anyone is
>already doing this, please let me know.

The rnews.c I posted does progressive expires based on disk space.  It 
does keep the disk very close to being full without ever getting 
full.  It's for C News and so it does allow the flexible per group 
specification for expiration time.  

The things I'd like to see are greater efficiency (a one pass system) 
and more smarts about what groups are being read.  

-- 
Jon Zeeff    	zeeff@b-tech.ann-arbor.mi.us  or b-tech!zeeff

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (01/05/90)

Re: judging news readership based on article access times

>Nice idea, but lots of programs run around accessing articles.  Old
>expire for one.  And anybody who decides to do a search of the whole
>News database.  (Which I do with newsclip programs from time to time.)

These programs (and outgoing feeds) tend to access the whole news database
so if you are using some kind of score based system, it wouldn't effect the
outcome.  You could reset the mod time for accesses that don't count.

It's not a very pure form of information but looking at it often and 
long enough would probably provide correct conclusions about interest 
in a group.  

-- 
Jon Zeeff    	zeeff@b-tech.ann-arbor.mi.us  or b-tech!zeeff

fmayhar@ladc.bull.com (Frank Mayhar) (01/09/90)

OK, Henry, I concede that it's effectively impossible to change *all* the
newsreaders in existence.  And it's impractical to store all .newsrc files
in one central location.  Still, it should be possible to keep a list of
subscribers, the machines that they live on, the last article they've seen
in each group, and the time they saw it.  If you do it right (e.g. by
constructing a set of library routines to maintain the stuff), you should
be able to retrofit this into existing newsreaders, and into NNTP.  Ignore
any entries that have "expired," i.e. their last access time is too long
ago.  Retrofit this into NNTP, rn, and a couple of the other most popular
readers, and run with it.  If a sysadmin has a newsreader that doesn't
support the subscription list, and he wants it, he can add it; he has the
libraries that support it.  This would solve the problem of having enough
information for a goal-driven expire, and of running arbitron in a
distributed environment.

How's that?
-- 
Frank Mayhar  fmayhar@ladc.bull.com (..!{uunet,hacgate,rdahp}!ladcgw!fmayhar)
              Bull HN Information Systems Inc.  Los Angeles Development Center
              5250 W. Century Blvd., LA, CA  90045    Phone:  (213) 216-6241

wayne@dsndata.uucp (Wayne Schlitt) (01/10/90)

In article <1990Jan8.230624.8684@ladc.bull.com> fmayhar@ladc.bull.com (Frank Mayhar) writes:
> 
>                           Still, it should be possible to keep a list of
> subscribers, the machines that they live on, the last article they've seen
> in each group, and the time they saw it.  If you do it right (e.g. by
> constructing a set of library routines to maintain the stuff), you should
> be able to retrofit this into existing newsreaders, and into NNTP.
> [ ... ]

excellent idea.  of course, you should make sure your elisp library
doesnt use any feature more recent than 18.40 or so.  you wouldnt want
to cause people with old versions of emacs to have too many problems
using your routines.  :->

(yes folks, the _only_ two news readers that i have ever used have
been written in emacs lisp.  i am so happy with gnus that i doubt that
i would ever spend the time to switch to another reader, even if it
was "better"...)

-wayne

fmayhar@ladc.bull.com (Frank Mayhar) (01/11/90)

In article <WAYNE.90Jan9160339@dsndata.uucp> wayne@dsndata.uucp (Wayne Schlitt) writes:
>[sarcasm deleted]
>(yes folks, the _only_ two news readers that i have ever used have
>been written in emacs lisp.  i am so happy with gnus that i doubt that
>i would ever spend the time to switch to another reader, even if it
>was "better"...)

All this means is that gnus won't use the subscription capability right away.
When someone decides to add it, it will.  If that person is you, so much the
better.  But the capability will be there, in NNTP and (possibly) in the news 
maintenance mechanism, to support it when you're ready for it.

Just because it's not easily feasible to add the capability to *every* news
reader *immediately* is no reason to not design it and implement it in
*some* news readers.  When system administrators need it, it will be there
(in an RFC, perhaps, and in a C library), and they can add it to the readers
that they and their users use.  Over time, most of the commonly-used news
readers will pick it up.  And any new ones can have it designed into them.

The thing about subscription lists is that, once you have them, it's possible
to do other things, like restricting certain newsgroups to certain subscribers
(or not allowing certain users to subscribe to certain newsgroups), or
collecting better readership statistics, or goal-driven expires, or several
other useful things.  Certainly, the end user doesn't get very much from the
capability, but that's not the point, is it?  It's the sysadmin that needs
it.
-- 
Frank Mayhar  fmayhar@ladc.bull.com (..!{uunet,hacgate,rdahp}!ladcgw!fmayhar)
              Bull HN Information Systems Inc.  Los Angeles Development Center
              5250 W. Century Blvd., LA, CA  90045    Phone:  (213) 216-6241