[news.software.b] Steady state news.

flee@dictionopolis.cs.psu.edu (Felix Lee) (12/12/90)

News volume is something like 15 megabytes a day (and growing).
expire needs about 10M to rewrite the history file.  Disk space used
by the news system varies by 25M or more on a daily basis.

Why not expire constantly?  Every time you receive some articles,
remove some other articles and free an equivalent amount of space.

In a steady-state news system, disk space usage is easier to control
and requires less care and feeding.

This is an extreme form of space-based expiry and inherits all of its
control problems.  Deciding which articles to expire may be difficult.
Has anyone figured out how space-based expiry should work?
--
Felix Lee	flee@cs.psu.edu

henry@zoo.toronto.edu (Henry Spencer) (12/12/90)

In article <Faivp#r3@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes:
>This is an extreme form of space-based expiry and inherits all of its
>control problems.  Deciding which articles to expire may be difficult.
>Has anyone figured out how space-based expiry should work?

We thought about it a bit for C News, and concluded that the policy issues
are complicated (it's simple if you expire everything at the same time,
but the interactions with selective expiry get messy) and we didn't feel
like solving them.
-- 
"The average pointer, statistically,    |Henry Spencer at U of Toronto Zoology
points somewhere in X." -Hugh Redelmeier| henry@zoo.toronto.edu   utzoo!henry

scs@lokkur.dexter.mi.us (Steve Simmons) (12/12/90)

flee@dictionopolis.cs.psu.edu (Felix Lee) writes:

>News volume is something like 15 megabytes a day (and growing).
>expire needs about 10M to rewrite the history file.  Disk space used
>by the news system varies by 25M or more on a daily basis.

>Why not expire constantly?  Every time you receive some articles,
>remove some other articles and free an equivalent amount of space.

We've recently gone to twice-daily expires.  Since Cnews expire
runs so quick, it's not been a significant load on the systems and
had done wonders to level out disk usage in /usr/spool/news.  We
expire on the average after 4 days, so that cuts the peaks by about
12%.  Definately worth it.

We've been keeping about 12 months of stats on how space is distributed
in /usr/spool/news.  Most of the data wouldn't mean much to another site
unless it used *exactly* our expire pattern, but I was quite surprised
to see how much variance there was from week start to week end and longer
term by (apparently) the school year.  If I ever come up with copious
free time I'll try and work those figures up into something rational.
-- 
"SO be it!  The fate of the UNIVERSE is in your hands!"
"Talk about job-related stress."

jef@well.sf.ca.us (Jef Poskanzer) (12/12/90)

In the referenced message, henry@zoo.toronto.edu (Henry Spencer) wrote:
}it's simple if you expire everything at the same time,

I've thought about doing this too, in my Copious Free Time(tm).
Seems like it would be good a good thing to add to the C news
package as an option.
---
Jef

  Jef Poskanzer  jef@well.sf.ca.us  {ucbvax, apple, hplabs}!well!jef
"Necessity is the plea for every infringement of human freedom.  It is
 the argument of tyrants; it is the creed of slaves." -- William Pitt

gary@proa.sv.dg.com (Gary Bridgewater) (12/12/90)

I really like this idea!

In article <1990Dec11.231124.24426@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <Faivp#r3@cs.psu.edu> flee@dictionopolis.cs.psu.edu (Felix Lee) writes:
>>This is an extreme form of spacebased expiry and inherits all of its
>>control problems.  Deciding which articles to expire may be difficult.

Keep a FIFO file per expire rule - i.e. a 1day file, 2day file, 1week file.
Maintaining the files is a pain but you could keep a separate tell() index
to the current actual beginning and only rebuild rebuild from time to time.
New articles just go on the end via append.

This assumes that the current history file is split in two - one
file to hold the IDs and another (set of) file(s) that contains the
posted/expire/article# info.  It might be handy to keep the size as well.
This data is all jammed together now since it is processed at the same time.
The problem is or has been coordinating them - i.e. knowing when to drop articles
and when IDs can be let go.   But I don't think they need to be coordinated
much beyond keeping an ID _at least_ as long as you keep the article. It
should be possible to devise a method to drop IDs separatly from the date -
perhaps an 8 bit pseudo-time stamp.  That is, e.g., this is week 0
so we scan the database and drop all old week 1 values, next week is
week 1 and we scan and drop all week 2 values. After week 0xff we just wrap.
I don't know where the 8 bits are going to come from - maybe 4 or 6 is
enough.  It's just a thought.

>>Has anyone figured out how spacebased expiry should work?

This should be answered on a per site basis.  You know where you store
news and how to check the current utilization and limits.  The
software would invoke your function and could either ask "How much
should we dump?" or "Should we dump more?" or some simple question
for which the answer can be determined.  When you have built the module
you "compile" the expire system - i.e. run the script producing script.


>We thought about it a bit for C News, and concluded that the policy issues
>are complicated (it's simple if you expire everything at the same time,
>but the interactions with selective expiry get messy) and we didn't feel
>like solving them.

Don't apologize for it.  Breaking up the functionality of posting and
expiring was a _Good_ _Thing_. Thanks.  It opens up a lot of possibilities
and lets effort be spent on individiual components without destroying
your whole news system.  Hacking B-news expire can be very scary.


Granted that adding all this file manipulation is a pain - dealing with
15 Mbytes a day is already a pain and it isn't going to get better.
-- 
Gary Bridgewater, Data General Corporation, Sunnyvale California
gary@sv.dg.com or {amdahl,aeras,amdcad}!dgcad!gary
C++ - it's the right thing to do.

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/12/90)

>>Why not expire constantly?  Every time you receive some articles,
>>remove some other articles and free an equivalent amount of space.
>
>We've recently gone to twice-daily expires.  Since Cnews expire
>runs so quick, it's not been a significant load on the systems and
>had done wonders to level out disk usage in /usr/spool/news.  We

I went to automatic, on demand, expires (run from rnews) a long time 
ago.  Much better than any kind of guessing needed disk space 
approach.  I agree that a smarter continuous expire could be even more 
efficient (and more difficult to implement).  

Sources are available via anon ftp from ais.org:~ftp/pub/cnews.speedups.Z


-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

brad@looking.on.ca (Brad Templeton) (12/13/90)

Yes, I have long thought that expire-as-it-arrives is the best solution.

You don't have to be stupid about it either.

For example, you could have a process run nightly (or less than that, as
need be) which prepares a list of articles on the system, sorted by their
"value"

The value is up to you, but it would no doubt be a factor of their age,
newsgroup, author/site (ie. keep local articles longer) modified by things
like explicit expiry, etc.   You can, of course, get age, explicit expiry,
newsgroup and posting site from the current C news history file.

You need one other thing, namely size.  (My space based expire use to
factor size in as well, reducing the value of big files, so that they
went slightly earlier -- theory being it was better to toss one 40K article
than 20 2K articles.   This might not apply in source groups.)

Anyway, you sort the articles based on their value, and you thus produce
a list, from the least valuable upwards, with the message-id and size in
disk blocks.

(You might add more to make it go faster on subsequent nights, since you
could, in theory, calculate value on new articles, figuring the value
of known articles from their old value and the elapsed time.)

On the other hand, you only need to store the least valuable N megabytes of
articles.

When your news database program (relaynews) comes along, it counts the
disk space it uses.   As it uses space, it goes through the expire file
and frees up enough files to get back that space -- keeping track of
extra stuff freed in some sequence file.   (It needs to store a seek
address into the value file too.)

And thus news takes exactly a fixed amount of space, but with some groups
going faster than others, etc. etc.

Since the purpose of expire is to keep news to a limited amount of disk
space, and not a limited number of days, this seems to me the ideal way
to do an expire.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

clear@cavebbs.gen.nz (Charlie Lear) (12/13/90)

In article <1990Dec12.093657.1488@proa.sv.dg.com> gary@proa.sv.dg.com (Gary Bridgewater) writes:
>I really like this idea!
>In article <1990Dec11.231124.24426@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>>We thought about it a bit for C News, and concluded that the policy issues
>>are complicated (it's simple if you expire everything at the same time,
>>but the interactions with selective expiry get messy) and we didn't feel
>>like solving them.
>
>Granted that adding all this file manipulation is a pain - dealing with
>15 Mbytes a day is already a pain and it isn't going to get better.

I've used the MS-Dos version of Waffle for around a year, and I've been
running an AT&T 3B2 for four months. With some expert help, we've got
cnews and trn working just fine.

But we keep running out of disk. (Running out of inodes was solved at the
first repartitioning cycle!)
 
The MS-Dos Waffle expire gets rid of articles on a numbered, rather than
timestamped, basis. You say you want to keep the last 500 articles in 
rec.pyrotechnics, fine. You're not interested in comp.sys.obscure.4bit then
set it to expire at four or five articles. My default is /keep=50 articles.
 
I find that to be an excellent system and haven't come across any problems
with it at all. Using cnews, if we get delays connecting to our host then we
can get fifteen or twenty megs of compressed news in in one hit. That gets
uncompressed and sits there for a couple of days, unless I get desperate
and manually cruise through the news directories deleting unwanted files.
Why wouldn't a numerical expire work under Unix? Has it been done before
and rejected as unworkable?

PS: Great software, Henry. I hope they pay you what you're worth... 8-)
-- 
--------------------------------------------------------------------------
Charlie "The Bear" Lear | clear@cavebbs.gen.nz | Kawasaki Z750GT  DoD#0221
The Cave MegaBBS  +64 4 643429  V32 | PO Box 2009, Wellington, New Zealand
--------------------------------------------------------------------------

ske@pkmab.se (Kristoffer Eriksson) (12/15/90)

In article <1990Dec12.213956.6544@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>The value is up to you, but it would no doubt be a factor of their age,
>newsgroup, author/site (ie. keep local articles longer) modified by things
>like explicit expiry, etc.

I think the most natural system for fixed-space expiry is just to look at
space down at the level of individual newsgroups, assigning a fixed space
for each group. Either that, or just expire in FIFO-order for the whole
system, which will get you newgroups that are sized relative to each other
according to their volume of postings.

-- 
Kristoffer Eriksson, Peridot Konsult AB, Hagagatan 6, S-703 40 Oerebro, Sweden
Phone: +46 19-13 03 60  !  e-mail: ske@pkmab.se
Fax:   +46 19-11 51 03  !  or ...!{uunet,mcsun}!sunic.sunet.se!kullmar!pkmab!ske