[news.software.b] Why does rnews duplicate its history?

jc@minya.UUCP (John Chambers) (08/29/87)

Say, I was wondering if anyone out there could explain something that
has had me puzzled for a while.  Usenet has, as well known, this "history"
file, and in addition, on this Sys/V machine, it has a history.d directory
that contains files 0 thru 9.  Each of these is a subset of "history",
and this sorta makes sense; as the install doc describes it, this is to
speed up history lookups.

However, the software as compiled here insists on keeping *both* the
large history file and the 10 little ones.  This strikes me as a bit
on the bizarre side.  I mean, isn't that getting the worst of both
worlds?  Poor rnews is looking everything up in both of them, and 
updating both.  It seems that just using the one large file would
save a whole lot of time.  The only other program that regularly
grovels in these files is expire, and it does both of them, too.

When I first installed usenet, I figured that I'd keep it simple and
just use one big history file.  One of my first puzzles was why each
line was there twice.  I examined the code, and discovered that it
quite intentionally does this if the history.d files are missing.  So
I created history.d/0, history.d/1, and so on, and it now writes the
two lines to two different files.

OK, I thought, perhaps it is like the log file; perhaps you don't
have to have the big history file at all.  So I deleted it.  (Well,
actually, in my paranoia, I merely renamed it; good thing, too...)
The next morning I found I had a mail message saying that there was
a problem with the history file, and it had been fixed.  I investigated,
and sure enough, "history" was back, with all the new articles in it.
On further investigation, I found something rather startling.  All
the files in history.d had been truncated, discarding all the history,
and they also contained only the new entries.

Now, this really strikes me as inexplicable.  According to the install
doc, the history.d files exist to speed things up.  Instead, they seem
to be a totally redundant history, the software insists that it *will*
write two copies of everything, and if I try to remove things, it reaches
out and garbages the rest.  Considering that the entries aren't sorted,
so two linear searches are being done thru some very large files, this
seems like hardly a nice way to do things.

Am I totally mistaken, or is this really supposed to be as wasteful of
the machine's cycles and disk space as it appears?  Or did I perhaps do
something stupid, and some obscure tweak will make it all work sanely?
If so, I guess I'm not smart enough to figure it out from the doc, so
someone out there maybe oughta explain it before I go in and wreak some
bodily harm on the poor unsuspecting code.  (It'd be easy enough to get
rid of all the history.d code.)

Or has this perhaps been corrected somewhere and I just didn't notice?
(I admit that, because patch does some very bizarre things with some 
of the patches, producing code that is totally uncompilable, I have 
installed only a couple of them so far.)

Bye for now.
-- 
	John Chambers <{adelie,ima,maynard}!minya!{jc,root}> (617/484-6393)

henry@utzoo.UUCP (Henry Spencer) (08/31/87)

> Say, I was wondering if anyone out there could explain something that
> has had me puzzled for a while.  Usenet has, as well known, this "history"
> file, and in addition, on this Sys/V machine, it has a history.d directory
> that contains files 0 thru 9.  Each of these is a subset of "history"...
> However, the software as compiled here insists on keeping *both* the
> large history file and the 10 little ones.  This strikes me as a bit
> on the bizarre side.  I mean, isn't that getting the worst of both
> worlds? ...

What you're seeing is a SysV emulation of something that makes a bit more
sense in non-SysV Unixes.  In the non-SysV world, there's a little library
called "dbm" which lets one maintain sort of a symbol table.  The point of
this is that news can use dbm to find out *quickly* whether an article with
a given message-id exists on the system already.  Using dbm is *much much*
faster than scanning the whole history file, especially in non-SysV systems
that historically have rather slow implementations of fgets.  The history
file provides the authoritative record; dbm's role is to provide fast access.
It's worth the price of having to keep both up to date. 

Unfortunately, there's no dbm in SysV, so it was faked.  I don't know whether
anybody looked carefully at the tradeoffs in SysV, although the 0-9 split
should be cutting search overhead noticeably.  The point is that it's not
really a duplication of effort -- the main history file and its subsidiaries
are specialized for different uses.
-- 
"There's a lot more to do in space   |  Henry Spencer @ U of Toronto Zoology
than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry

rick@seismo.CSS.GOV (Rick Adams) (09/01/87)

Yes it is a bit bizarre to keep two copies of the history file. This
was fixed in patch #4 (we're up to 8 with #9 coming soon).

--rick