[news.software.b] What If...I remove "/usr/lib/news/history*" ?

taylor@limbo.Intuitive.Com (Dave Taylor) (03/02/90)

I have this continuing problem with netnews in that it takes up too
much disk space (what's new ;-) and expires take incredibly long to
run.  So I was a'thinkin'...

What if I were to remove the following files from /usr/lib/news ?

	history
	history.dir
	history.pag

As far as I am aware -- and keep in mind that the only news reader
we have installed on this site is "rn" -- the only purpose that the
files serve are to ensure that duplicate articles aren't allowed.
Am I right?  (I have heard some reference to commands in "rn" that
allow you to utilize the file for certain purposes, but we rarely
seem to use them, and could probably live without them without too
much difficulty: based on the '...' checking, it appears that "rn"
doesn't, for example, use the file to deal with ^P 'show previous
article in this discussion').

If we remove this, I assume that what I'd need to do would be to
write a new "unpack news batches" program, right?  That'd be okay;
I'm willing to do that...in fact, as far as I can tell, it isn't
too much work either; you get a batch file whose name is handed to
you, then run it through uncompress, then read through a big 'shar'
like file which contains a stream of articles, each headed with an
indication of how many lines are contained therein.  To unpack,
simply put the article into its own temp file, check its MessageID
against those already on the machine, then if unique, add it to the
files on the machine, updating the active file and the local group
specific sequence number.  

Really what I'd like to do is to write an unpacker that will 
immediately throw away articles from groups that appear/don't appear
in a file.  The goal would be to have the file generated via a modified
pexpire(1L) program to reflect JUST the groups that people are actually
actively reading on the machine.  ALL other articles would vanish 
without a trace, never to take up disk space at all!  Creating a nice
clean piece of code that is easy to understand, maintain, and modify
would be a good side-benefit, as would the incredibly faster expire 
that could be written too (like "find . -mtime +4 -exec /bin/rm"!)
(though even the need for a faster expire would greatly reduce once you 
were guaranteed that ONLY the articles you're interested in are actually
sitting on your disk)

This all really hinges around the history file, though.  Clearly, when
my expires take many many hours to run, it's because they're munging
through the slow and painful process of continually updating the DBM
history database ... (right?) ... I mean, I can run "fixactive(1L)"
and have it check *every* article in my /usr/spool/news directory in
under 2 minutes total!

	I welcome thoughts on this, either here on the net or via
	email...and if you're interested in a similar piece of software,
	please feel free to drop me a note with your requirements too.

						-- Dave Taylor
Intuitive Systems
Mountain View, California

taylor@limbo.intuitive.com    or   {uunet!}{decwrl,apple}!limbo!taylor

henry@utzoo.uucp (Henry Spencer) (03/02/90)

In article <490@limbo.Intuitive.Com> taylor@limbo.Intuitive.Com (Dave Taylor) writes:
>What if I were to remove the following files from /usr/lib/news ?
>	history
>	history.dir
>	history.pag
>As far as I am aware -- and keep in mind that the only news reader
>we have installed on this site is "rn" -- the only purpose that the
>files serve are to ensure that duplicate articles aren't allowed.
>Am I right? ...

Nope, sorry.  They have one or two other notable roles.  In particular,
expire relies completely on "history", and cannot function without it.
Removing the .dir and .pag files will break duplicate checking and some
of the fancier functions in sophisticated news readers, but otherwise
shouldn't do anything awful that I can think of.  I do suggest learning
more about the functioning of the news software before trying such drastic
tampering with its databases, though.

>If we remove this, I assume that what I'd need to do would be to
>write a new "unpack news batches" program, right?  That'd be okay;
>I'm willing to do that...in fact, as far as I can tell, it isn't
>too much work either...

Ha ha.  That's what I told Geoff five or six years ago.  I'm not sure
he's forgiven me for it.  What you are talking about doing is completely
reinventing rnews/relaynews (depending on which news you're running), and
that is *not* a small job.  It's 4000+ lines of code in C News, and
with news at its current volume, you'd better pay a whole lot of attention
to performance when you do it, because otherwise it's easy to spend all
day processing news and still not be able to keep up.

>simply put the article into its own temp file, check its MessageID
>against those already on the machine, then if unique...

How are you going to check the MessageID against the others on the
machine if you've deleted the database that keeps track of such things?
That's what the history.* files are for!

>Really what I'd like to do is to write an unpacker that will 
>immediately throw away articles from groups that appear/don't appear
>in a file.  The goal would be to have the file generated via a modified
>pexpire(1L) program to reflect JUST the groups that people are actually
>actively reading on the machine.  ALL other articles would vanish 
>without a trace, never to take up disk space at all! 

You can do this with C News, by changing the fourth field of active-file
lines to "x".  All you need to do is the data gathering to figure out
which newsgroups to do it to (which isn't as easy as it looks, e.g. if
you've got users who don't always read all the newsgroups they subscribe
to) and a little bit of code to modify the active file accordingly.

>Creating a nice
>clean piece of code that is easy to understand, maintain, and modify
>would be a good side-benefit

We fancy we've done this with C News.  It was/is a whole lot more work
than we expected.  You might want to look at what we've done before you
strike out to write your own.

> as would the incredibly faster expire 
>that could be written too (like "find . -mtime +4 -exec /bin/rm"!)

Sorry, find is not a particularly fast way to expire things.  C News
expire is faster, last time I compared timings, and it gives you much
more control.

>This all really hinges around the history file, though.  Clearly, when
>my expires take many many hours to run, it's because they're munging
>through the slow and painful process of continually updating the DBM
>history database ... (right?) ... I mean, I can run "fixactive(1L)"
>and have it check *every* article in my /usr/spool/news directory in
>under 2 minutes total!

If your expires take many hours to run, you're running the old B News
expire, which was and is an incredible hog.  C News expire is vastly
faster, even with dbm updates.  Also, I think you've misunderstood
fixactive -- if I've remembered correctly what it does, it only looks
at each *directory* of /usr/spool/news, *not* at each article.  The
difference is, uh, important.

>	I welcome thoughts on this, either here on the net or via
>	email...and if you're interested in a similar piece of software,
>	please feel free to drop me a note with your requirements too.

Dave, I'm sorry, but from this article I get a very strong impression
that you simply don't know the news system well enough to understand
what you'd be undertaking.  It's not that easy.  I speak from experience.
-- 
MSDOS, abbrev:  Maybe SomeDay |     Henry Spencer at U of Toronto Zoology
an Operating System.          | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

brad@looking.on.ca (Brad Templeton) (03/03/90)

Actually, throwing away history is exactly what I did a few years
ago when I was a pure simple leaf site on a small machine.

History was gone, but as a leaf, my feed would never send me duplicates,
so that was not a problem.   Cancels didn't work, but I could live with
that.

Expire was just 'find' piped into a removing program -- I could only keep
a few days news, and like most leaves, under such circumstances fancy
explicit expire dates were not a problem.

Xrefs: did cross referencing for rn.

It saved a lot of disk space and made B news faster too.  Not to mention
expire.  (Oddly enough, I only wrote my space based expire after I went back
to being a non-leaf)

In many ways #ifdef LEAF is not a bad idea for a news program.  There is
lots of stuff you don't have to do.  That is shrinking with time, however.
Cancel and supersedes become more important, and they need history.  But a
leaf may not need broadcast code etc.

This is important because leaves are usually smaller machines, even 286s.

My first was an ONYX C8002!  256K ram, 10 meg disk (including Unix, and I
still had room to run news and do other stuff, so there!)
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (03/03/90)

I agree with Henry.  Get C News and compile expire with dbz + INCORE.  It
runs quite quickly.


-- 
Jon Zeeff    	zeeff@b-tech.ann-arbor.mi.us  or b-tech!zeeff

icsu6000@caesar (Jaye Mathisen) (03/04/90)

In article <490@limbo.Intuitive.Com> taylor@limbo.Intuitive.Com (Dave Taylor) writes:
>What if I were to remove the following files from /usr/lib/news ?
>	history
>	history.dir
>	history.pag
>As far as I am aware -- and keep in mind that the only news reader
>we have installed on this site is "rn" -- the only purpose that the
>files serve are to ensure that duplicate articles aren't allowed.
>Am I right?  (I have heard some reference to commands in "rn" that




This would result in near-disaster.  No duplicate checking, and
expire would have a tough time running.




>indication of how many lines are contained therein.  To unpack,
>simply put the article into its own temp file, check its MessageID
>against those already on the machine, then if unique, add it to the



Where are you going to store Message-ID's...  That's one of the
functions of history.*




>Really what I'd like to do is to write an unpacker that will 
>immediately throw away articles from groups that appear/don't appear
>in a file.  The goal would be to have the file generated via a modified
>pexpire(1L) program to reflect JUST the groups that people are actually
>actively reading on the machine.  ALL other articles would vanish 



How do you handle the case of looking through comp.archives, and finding
that there's this interesting thingie in group x, but since nobody subscribes
to group x, the spooled article isn't there to find any more info...




>This all really hinges around the history file, though.  Clearly, when
>my expires take many many hours to run, it's because they're munging



Well, I don't know what hardware you're using (I think HP from a previous
posting :-)), but I keep 2 weeks of Usenet, plus every alternative hierarchy
I can lay my hands on, using a HP350, and expire cranks through nightly
in about 2.5 hours...  



>through the slow and painful process of continually updating the DBM
>history database ... (right?) ... I mean, I can run "fixactive(1L)"

In a nutshell, expire creates a new history file each nite, after making
one run through the history.* files, and unlinking expired articles.  It
names it nhistory.{dir,pag} while creating it, and then renames it to
history.{dir,pag}  etc.  

There's a sort in there somewhere also.



--
+-----------------------------------------------------------------------------+
| Jaye Mathisen,systems manager       Internet: icsu6000@caesar.cs.montana.edu|
| 410 Roberts Hall                      BITNET: icsu6000@mtsunix1.bitnet      |
| Dept. of Computer Science	                                              |