[news.software.b] history.dbm contents?

heiby@mcdchg.chg.mcd.mot.com (Ron Heiby) (05/14/91)

I've looked through the News 2.11 patch 19 source and believe I've
found every place where the history DBM file is referenced.  When an
article arrives on the system, the dbm file is queried to see if the
message-id already exists (case insensitive check).  If not, then the
article winds up getting installed on the system, a line is written to
the text history file, and an entry is made to the DBM history file
using the message-id as the key and the file offset into the text
history file of the line describing the article as the data.

While it might be really handy for some software that has a message-id
and wants to convert that to pathname(s) or such to have that info in
the DBM file that way, I can't find any software that actually makes
use of that information.  I've written a fairly complete expire Perl
script, which I'm about ready to start testing.  It occurs to me,
though, that if the actual *data* stored in the DBM file isn't used by
anyone, just the *key* (whether or not the key exists), then it
doesn't really matter whether that information is updated by expire.
If no one uses it, expire could simply delete keys for the ancient
articles and leave all the others untouched.  I would think that this
would make for a noticeable speed improvement.

Am I all wet, here?  Do I actually have to maintain that
correspondence between DBM and text forms of the history file?  If so,
why?  What software makes use of the data stored, rather than the fact
that some data exists?

BTW, has anyone done any speed comparisons between "good old" dbm and
GNU dbm?  It seems like on my system, the standard 2.11 expire.c runs
about three times slower if libgdbm.a (version 1.3) is linked in.
-- 
Ron Heiby, heiby@chg.mcd.mot.com	Moderator: comp.newprod
"Wrong is wrong, even when it helps you." Popeye

henry@zoo.toronto.edu (Henry Spencer) (05/14/91)

In article <62955@mcdchg.chg.mcd.mot.com> heiby@mcdchg.chg.mcd.mot.com (Ron Heiby) writes:
>...and an entry is made to the DBM history file
>using the message-id as the key and the file offset into the text
>history file of the line describing the article as the data.
>
>While it might be really handy for some software that has a message-id
>and wants to convert that to pathname(s) or such to have that info in
>the DBM file that way, I can't find any software that actually makes
>use of that information...

Some of the readers do.  They want to look up articles by message ID;
the only quick way to do that is to use the dbm/dbz index to locate the
article's history line.

>BTW, has anyone done any speed comparisons between "good old" dbm and
>GNU dbm? ...

It's kind of pointless, since dbz blows the doors off both of them for
this application, and keeps much smaller files to boot.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

ian@airs.com (Ian Lance Taylor) (05/15/91)

heiby@mcdchg.chg.mcd.mot.com (Ron Heiby) writes:

>While it might be really handy for some software that has a message-id
>and wants to convert that to pathname(s) or such to have that info in
>the DBM file that way, I can't find any software that actually makes
>use of that information.

GNUS does, at least.  Since the subject has come up, I'd like to
mention that at least once a week I curse the fact that I can't look
up the articles that appear in the References: line (I can do this
using GNUS, but on my system it's too slow for me).  I hope more
newsreaders will add such a feature in the future.
-- 
Ian Taylor                   ian@airs.com                  uunet!airs!ian
First person to identify this quote wins a free e-mail message:
``Nobody believed him, so out of politeness to his listeners he pretended
  to be joking.''

jerry@olivey.ATC.Olivetti.Com (Jerry Aguirre) (05/16/91)

In article <62955@mcdchg.chg.mcd.mot.com> heiby@mcdchg.chg.mcd.mot.com (Ron Heiby) writes:
>While it might be really handy for some software that has a message-id
>and wants to convert that to pathname(s) or such to have that info in
>the DBM file that way, I can't find any software that actually makes
>use of that information.  I've written a fairly complete expire Perl

As mentioned some news readers do make use of lookup by ID.  The idea
is that it is possible to read the article mentioned in the refferences
line.  Of course the trend currently is to include the entire article
being refferenced eliminating the need to find the "parent" article.
:-)

>script, which I'm about ready to start testing.  It occurs to me,
>though, that if the actual *data* stored in the DBM file isn't used by
>anyone, just the *key* (whether or not the key exists), then it
>doesn't really matter whether that information is updated by expire.
>If no one uses it, expire could simply delete keys for the ancient
>articles and leave all the others untouched.  I would think that this
>would make for a noticeable speed improvement.

Yes, you could store 0 bytes of data and fulfill what is required for
duplicate suppression.  That should result in a smaller history.pag
file.  But consider, you are storing perhaps 30 bytes of key and 4 bytes
of data.  Cutting back from 34 to 30 bytes is not going to make a
significant improvement.

I have a "newalias" program for handling updates to my mail alias file
that does dbm adds and deletes instead of rebuilding the entire thing
from scratch.  It runs about 10 times faster that way.

But the history.pag file is a different case.  Even if we ignore the
period of inconsistancy of the pointers into the text file that whould
exist if it was being updated, there is still a more significant
problem.  The history.pag file depends on being "sparse" and, as it says
in the documentation, deleting an entry does not free the disk block.
If you went along deleting entries the distribution would eventually
result in every disk block in the history.pag file actually being
allocated.

In other words the physical size of the history.pag file would grow to
equal its logical size.  Given how the logical size of the history.pag
file shocks people until they find out it is not really that big I think
this would not be a good idea.  It might be OK for a few times but at
some point one would want to rebuild from scratch.

I have been using the dbz package with my B news and it works great.
The history.pag file is lots smaller and the expire is about 10 times
faster.  The dbz package takes advantage of the fact that both the key
and the data are in the text file so it only needs to store the offset.
Given that the key is lots bigger than the offset this is a bigger win
than not storing the offset.

				Jerry Aguirre