[comp.sources.d] A few questions/comments on Rkive

jos@idca.tds.PHILIPS.nl (Jos Vos) (07/05/89)

After reading quickly through the accompanying documentation of rkive,
I have the following remarks (please don't flame if an answer can be
found in the documentation - I read it *quickly*):

-  First of all, it looks GREAT. I hope to start using it soon!
   Besides sources I want (have) to use it for a number of "normal"
   newsgroups too.

-  It is mentioned in rkive(1) that an existing file is (by default)
   not overwritten. What happens then?

   This area of problems is also indicated in the IDEAS file: a
   more flexible naming scheme should be possible besides the
   article number (and the other two). E.g. a format
   string using % notations for time parameters (day, hour, seconds).
   Also enabling the use of a user program that generates the filename
   (without the directory) would be a possibility: this is the most
   generic way and quite easy to implement (but not efficient...).

-  How it is known whether an article is already archived?

   The previous problem becomes BIG if it can only be concluded
   that an article is already archived because the file exists...

-  How are crosspostings handled?

-  Is it not possible to use rkive as a program directly
   from the sys file (that is, with the article as stdin)?
   Probably not (the first problem SHOULD be solved then).
   I think this is a much cleaner way of archiving the news, isn't it?
   (who knows what happens with /usr/spool/news before tonight :-))

I know I could find the answer of some questions in the code, but
I didn't have time to look at that now. And besides that, much
people (?) will sooner or later have the same questions.

-- 
-- ######   Jos Vos   ######   Internet   jos@idca.tds.philips.nl   ######
-- ######             ######   UUCP         ...!mcvax!philapd!jos   ######

kent@ssbell.UUCP (Kent Landfield) (07/06/89)

In article <1123@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes:
>After reading quickly through the accompanying documentation of rkive,
>I have the following remarks (please don't flame if an answer can be
>found in the documentation - I read it *quickly*):
>
>-  First of all, it looks GREAT. I hope to start using it soon!
>   Besides sources I want (have) to use it for a number of "normal"
>   newsgroups too.
>
>-  It is mentioned in rkive(1) that an existing file is (by default)
>   not overwritten. What happens then?

rkive handles it differently depending on whether the article is a 
REPOST or not.  If rkive detects that the destination (or target file) 
name exists and the article is a ....

NON-REPOST Article:
	In the event that any duplicate is encountered, rkive creates a
	problems directory (if necessary) as specified in the PROBLEMS
	line of the rkive.cf configuration file. It then stores the 
	inbound article in the problems directory within a subdirectory
	that reflects the name of the newsgroup the duplicate was found
	in. The archive administrator(s) specified in the rkive.cf are 
	mailed a message indicating what has occured. The original in 
	the archive is not overwritten.  The duplicate then becomes a 
	matter of manual cleanup.

REPOST Article:
	Depending on how the software is compiled... REPOSTS are handled
	in one of three ways currently.	 In all three methods the archive
	administrator is notified of the occurrance via e-mail.
	
	 MV_ORIGINAL
	     The original article is placed (moved) into a subdirectory in
	     the problems directory named "Originals". The inbound reposted 
	     article is then placed into the archive in the correct position.
	     (My favorite..:-))
	
	 ADD_REPOST_SUFFIX 
	     If ADD_REPOST_SUFFIX is defined, all reposts will have the 
	     string specified in the REPOST_SUFFIX define appended to the 
	     archive filename so that a repost of elm/part07 would appear 
	     in the archive as elm/part07-repost prior to any compression.
	     (Careful with this one folks..)
	
	 No Reposting Defines specified:
	    The inbound article would be placed into the archive in the 
	    correct position only if the initial article is not in the archive.
	    Otherwise the reposted article is placed in the problems directory 
	    as a normal duplicate article is now.
	
>   This area of problems is also indicated in the IDEAS file: a
>   more flexible naming scheme should be possible besides the
>   article number (and the other two). E.g. a format
>   string using % notations for time parameters (day, hour, seconds).
>   Also enabling the use of a user program that generates the filename
>   (without the directory) would be a possibility: this is the most
>   generic way and quite easy to implement (but not efficient...).

The IDEAS file describes the need for an alternate way to archive newsgroups
that do not support the auxiliary headers. This is necessary since the
Article-Number method uses the "news subsystem" naming scheme. If a news
system numbering was restarted from scratch or the entire archive was moved
to a different machine, problems could occur due to the potential for duplicate
filenames.  This is *not* something that you do everyday but it is a problem 
that *can* be avoided.

A patch is in testing right now to be released next week that has an
additional method of archiving. Chronological archiving support has been
added which allows articles to be archived in a format of...

	volumeYY/MOY/YYMMDD.II or volumeYY/YYMMDD.II where 
		YY  - two digit year,
		MOY - Jun, Jul etc (table configurable),
		MM  - two digit month
		DD  - two digit day
		II  - daily issue number which represents the number
		      of the article in the order of processing.
example:
	volume89/Jul/890706.01 or volume89/890706.01

I agree a generic hook is needed for the actual storage vehicle so
as to support new methods over distributed media. That is in the works
although *any* and *all* ideas are welcome and encouraged...

>-  How it is known whether an article is already archived?
>
>   The previous problem becomes BIG if it can only be concluded
>   that an article is already archived because the file exists...

The test as to whether an article is already archived is done by checking 
if the archive file exists. I'm not sure what you mean by BIG. I have running 
rkive since Feburary and I have not moved my archive to another machine or 
restarted my News numbering once. :-) (Wait till I put up Cnews though :-))
Please remember, this archiver was initially designed as a sources archiver. 
I have added the Chronological method which solves the problems of restarting 
the news system and moving the archive that could have been a problem with 
Article-Number archiving.  You can now archive non-sources groups just as 
effectively as sources groups. Well, as soon as the patch is posted next 
week.. :-) 

>-  How are crosspostings handled?

Currently, crosspostings are *not* handled. rkive archives the newsgroups
that you specify in the rkive.cf configuration file.  It blindly ignores
crosspostings and worries only about the target newsgroup.  What does this
mean ? If you have specified that you wish to archive comp.sources.unix
and comp.sources.d and the monthly informational posting goes out, you
will currently get *two* copies..... This is a recognized deficiency. It needs
to check to see if any of the crossposted groups are being archived as well
and attempt to link the files. I say attempt since my archives here at ssbell
reside on 4 different filesystems and as soon as I finish the distributed
version, they will be scattered on as many machines. :-)

>-  Is it not possible to use rkive as a program directly
>   from the sys file (that is, with the article as stdin)?
>   Probably not (the first problem SHOULD be solved then).

No. rkive is meant to run from cron and not receive the articles from stdin.
To be quite honest, I never really thought about doing it that way but
if I ... :-) :-) Currently, that is not in the works.

>   I think this is a much cleaner way of archiving the news, isn't it?
>   (who knows what happens with /usr/spool/news before tonight :-))

On my machines, I know... :-)

>I know I could find the answer of some questions in the code, but
>I didn't have time to look at that now. And besides that, much
>people (?) will sooner or later have the same questions.

Please, ask away! I *expected* that I would be answering questions. Better 
sooner than later. I have been receiving some *GREAT* ideas from the net 
as to ways to improve and enhance rkive's functionality. Thanks!  I answer
my mail so if you have not gotten an answer back, I probably didn't get
it. I am planning on posting the patch to comp.sources.bugs and sending
a copy as well to rich. 

Distributed archiving is next on my list. Also the "random software downloader"
for retrieving complete packages, patches and all, is in development.  Anyone 
want to help me name the "random software downloader" ? get is already taken 
and rsd sounds so bland.. :-) :-)

>-- ######   Jos Vos   ######   Internet   jos@idca.tds.philips.nl   ######

			Thanks Jos!
				-Kent+
---
Kent Landfield               UUCP:     kent@ssbell
Sterling Software FSG/IMD    INTERNET: kent@ssbell.uu.net
1404 Ft. Crook Rd. South     Phone:    (402) 291-8300 
Bellevue, NE. 68005-2969     FAX:      (402) 291-4362

jos@idca.tds.PHILIPS.nl (Jos Vos) (07/06/89)

In article <520@ssbell.UUCP> kent@ssbell.UUCP (Kent Landfield) writes:
>In article <1123@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes:

>A patch is in testing right now to be released next week that has an
>additional method of archiving. ....
> ....
>example:
>	volume89/Jul/890706.01 or volume89/890706.01

I already looked in the code and saw that it would be quite easy
to add an archive method that popen's a user program (specified in some
way) that puts a plain filename on stdout.
Than I can play with SysV date +%... as much as I like.
The next step could be to let that program generating a relative
pathname, in that case the program could be just a script with
	date '+volume%y/%h/%y%m%d'
Only the problem with the .seq suffix should be solved then,
if you *want* to solve it.

>The test as to whether an article is already archived is done by checking 
>if the archive file exists. I'm not sure what you mean by BIG. ...

I only meant that the previous item (generating own filenames) would
cause that you can't detect an article --- filename-for-saving relation
anymore. Doesn't that problem also occur in your proposed scheme?
What about the .archived file mentioned somewhere in the documentation?

>>-  How are crosspostings handled?
>Currently, crosspostings are *not* handled. ...

It becomes a problem if you want to use rkive for archiving a lot of
newsgroups, i.e. not only sources. But I can imagine it's quite
difficult to handle that correctly (w.r.t. the rkive.cf file's lists).

>>-  Is it not possible to use rkive as a program directly
>>   from the sys file (that is, with the article as stdin)?
>No. rkive is meant to run from cron and not receive the articles from stdin.

Still a nice item for the IDEAS file :-)

-- 
-- ######   Jos Vos   ######   Internet   jos@idca.tds.philips.nl   ######
-- ######             ######   UUCP         ...!mcvax!philapd!jos   ######

clewis@eci386.UUCP (07/07/89)

I've just started up rkive and like it.  Thanks.
But I do have a few preliminary comments - one already touched on:

In article <520@ssbell.UUCP> kent@ssbell.UUCP (Kent Landfield) writes:
>In article <1123@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes:

>>-  Is it not possible to use rkive as a program directly
>>   from the sys file (that is, with the article as stdin)?
>>   Probably not (the first problem SHOULD be solved then).

>No. rkive is meant to run from cron and not receive the articles from stdin.
>To be quite honest, I never really thought about doing it that way but
>if I ... :-) :-) Currently, that is not in the works.

Actually, what might be better (from the point of view of trying to
collect lots of articles before bothering the MAIL: people) is to parse
a batch file.  For example, I have the following (C-news) sys file entry:

    maps:comp.mail.maps/all:f:

Which places the file name of each article in comp.mail.maps, and I
have a cron entry that runs a script that pulls each file name out
and unpacks it, calls pathalias and sends mail to me.  [I'm sending
it off to comp.sources.misc tonight]

This allows you to schedule when rkive runs, and it isn't dependent
on expiration (much).  Receiving on stdin could be rather unpleasant
w.r.t. performance at times....

The other main deficiency that I discovered so far is that you try
*so* hard to ensure that the .cf file is correct, that you don't allow
some additional niceties.  For example, each entry in "MAIL:" is
verified by calls to getpwnam().  That means that all three entries
will fail validation:

	MAIL: eci386!clewis, clewis@eci386, alias-in-global-alias-file
-- 
Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc.
UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis
Phone: (416)-595-5425

kent@ssbell.UUCP (Kent Landfield) (07/12/89)

First off, sorry for the delay in getting this response out. I finallly
took a *real* vacation. First in at least 3 years... :-) Anyway......

In article <1129@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes:
# I already looked in the code and saw that it would be quite easy
# to add an archive method that popen's a user program (specified in some
# way) that puts a plain filename on stdout.
# Than I can play with SysV date +%... as much as I like.
# The next step could be to let that program generating a relative
# pathname, in that case the program could be just a script with
# 	date '+volume%y/%h/%y%m%d'
# Only the problem with the .seq suffix should be solved then,
# if you *want* to solve it.

I see that the "hook" is apparent to you as well... :-)

In article <520@ssbell.UUCP> I wrote:
# The test as to whether an article is already archived is done by checking 
# if the archive file exists. I'm not sure what you mean by BIG. ...

Jos Vos writes:
# I only meant that the previous item (generating own filenames) would
# cause that you can't detect an article --- filename-for-saving relation
# anymore. Doesn't that problem also occur in your proposed scheme?
# What about the .archived file mentioned somewhere in the documentation?

The problem does not occur with the Chronological archiving. As for the
.archived file, it is used to indicate what the status is of the articles
currently on the system waiting to be expired. If an article is expired, the
entry in the .archived file is removed. In this way the .archived file for
the newsgroup is self-maintaining (i.e. it does not grow out of bounds) and
is *not* a log of all previous "article number - archive resting place" 
entries. The .archived file is used so that an article is only archived 
once, the first time rkive is run after the article reaches the system.

I wrote:
# Currently, crosspostings are *not* handled. ...

Jos Vos writes:
# It becomes a problem if you want to use rkive for archiving a lot of
# newsgroups, i.e. not only sources. But I can imagine it's quite
# difficult to handle that correctly (w.r.t. the rkive.cf file's lists).

No, it really will not be too hard since I just need to check if the
group the article is crossposted to is also being archived and act
accordingly. At this point it just has not been done because I have
not been able to find the time. It is currently not a problem as long
as you have disk space. :-) :-) You will not lose any articles, just
duplicate them. :-(

Jos Vos writes:
#  Is it not possible to use rkive as a program directly
#  from the sys file (that is, with the article as stdin)?

I wrote:
# No. rkive is meant to run from cron and not receive the articles from stdin.

Jos Vos writes:
# Still a nice item for the IDEAS file :-)

Granted. The only problems that I see here is that I need some information
about where to archive, modes, owners etc that could make the command line
real ugly... :-) It can be worked but it is not real high on the priority
list. It has been put into the IDEAS file for now...

			-Kent+
---
Kent Landfield               UUCP:     kent@ssbell
Sterling Software FSG/IMD    INTERNET: kent@ssbell.uu.net
1404 Ft. Crook Rd. South     Phone:    (402) 291-8300 
Bellevue, NE. 68005-2969     FAX:      (402) 291-4362

kent@ssbell.UUCP (Kent Landfield) (07/12/89)

In article <1989Jul7.022708.4826@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
# Actually, what might be better (from the point of view of trying to
# collect lots of articles before bothering the MAIL: people) is to parse
# a batch file.  For example, I have the following (C-news) sys file entry:
# 
#     maps:comp.mail.maps/all:f:
# 
# Which places the file name of each article in comp.mail.maps, and I
# have a cron entry that runs a script that pulls each file name out
# and unpacks it, calls pathalias and sends mail to me.  
# 
# This allows you to schedule when rkive runs, and it isn't dependent
# on expiration (much).  Receiving on stdin could be rather unpleasant
# w.r.t. performance at times....

Ok. I think I am being dumb here (it would not be a first) but I don't see 
how this is really any different then what rkive does now. I can schedule 
rkive to run via cron any time I wish and with as much frequency. The 
difference is that this would get the file names from a different file/stdin 
where as the current rkive gets the file names from the news directory 
structure. You are still dependent on expire since the file specified 
must still exist in both the current and this approach when it is time to 
"rkive" the file.  Like I said, I am probably just missing the point.

# The other main deficiency that I discovered so far is that you try
# so* hard to ensure that the .cf file is correct, that you don't allow
# some additional niceties.  For example, each entry in "MAIL:" is
# verified by calls to getpwnam().  That means that all three entries
# will fail validation:
# 	MAIL: eci386!clewis, clewis@eci386, alias-in-global-alias-file

Yep. Try to check too little and you fall in a black hole. Try to check
to much and you choke off flexible usage. Somewhere in between is a path
on which you can breath... I made this a compile time decision by adding
an ifdef around the getpwnam call.

			-Kent+
---
Kent Landfield               UUCP:     kent@ssbell
Sterling Software FSG/IMD    INTERNET: kent@ssbell.uu.net
1404 Ft. Crook Rd. South     Phone:    (402) 291-8300 
Bellevue, NE. 68005-2969     FAX:      (402) 291-4362

jos@idca.tds.PHILIPS.nl (Jos Vos) (07/12/89)

In article <521@ssbell.UUCP> kent@ssbell.UUCP (ssbell Admin) writes:

>The problem does not occur with the Chronological archiving. As for the
>.archived file, it is used to indicate what the status is of the articles
>currently on the system waiting to be expired. ...

A more general way of registrating the archived articles is the
combination of the Message-Id and the Posting-Date.
That's quite unique *forever*, I hope :-)
(Any other suggestions are welcome, but I couldn't find something else).

Then crosspostings, feeding via stdin etc. are then quite easy to
handle. If you divide your databases into parts (e.g. a file
89 for a full year's database, or 8901, 8902, ... for monthly databases)
according to the posting dates you can easy check whether an article
is already archived in the stone age (i.e. before 1-1-1970 :-) ).

At expire time, it's still possible (if you want that) to 
minimize the database file(s) according to the expired articles.
If you don't want it, you'll get a real usefull history.

I'll work in out in more detail (I think I need it anyway and I
probably will implement it for my own private use) and post it.
Then you and The NET can decide what to do with it.

-- 
-- ######   Jos Vos   ######   Internet   jos@idca.tds.philips.nl   ######
-- ######             ######   UUCP         ...!mcvax!philapd!jos   ######

clewis@eci386.uucp (Chris Lewis) (07/13/89)

In article <523@ssbell.UUCP> kent@ssbell.UUCP (Kent Landfield) writes:
>In article <1989Jul7.022708.4826@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
># Actually, what might be better (from the point of view of trying to
># collect lots of articles before bothering the MAIL: people) is to parse
># a batch file.  For example, I have the following (C-news) sys file entry:
># 
>#     maps:comp.mail.maps/all:f:
># 
># Which places the file name of each article in comp.mail.maps, and I
># have a cron entry that runs a script that pulls each file name out
># and unpacks it, calls pathalias and sends mail to me.  
>
>Ok. I think I am being dumb here (it would not be a first) but I don't see 
>how this is really any different then what rkive does now. I can schedule 
>rkive to run via cron any time I wish and with as much frequency. The 
>difference is that this would get the file names from a different file/stdin 
>where as the current rkive gets the file names from the news directory 
>structure. You are still dependent on expire since the file specified 
>must still exist in both the current and this approach when it is time to 
>"rkive" the file.  Like I said, I am probably just missing the point.

Thinking more on it, the expire argument is probably bogus, but:

The main advantage is that you don't have to rummage around in the directory,
possibly parse the files, and check your database to see whether you've
already unpacked it.  You know that every single file listed in the batch
is new and you've not seen it before.  In fact, with this approach you 
*NEVER* have to have rkive reread its own databases or scan directories - 
the index files are merely logs of what things rkive's already snarfed, and
the batch file is names of files that rkive hasn't read yet.  Though, of 
course you do have to be fairly careful not to clobber things if they 
reappear, and you have to read the control file to decide what to do with 
each one.

[This discussion is probably moot because you've already implemented
a "fancy" version - what really bugs me is the map unpackers that people
write that go into the comp.mail.maps directory and runs pathalias *only*
on what's in comp.mail.maps.  Missing expired entries, getting duplicate 
copies of maps (when you don't have supercede or someone goofed), and
being unable to compress the map files.  And, chances are, running as
root and someone put a trojan into one of the maps...]

[re: MAIL: destination checking]
>... I made this a compile time decision by adding
>an ifdef around the getpwnam call.

Oops, I musta missed that somewhere.
-- 
Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc.
UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis
Phone: (416)-595-5425

kent@ssbell.UUCP (Kent Landfield) (07/14/89)

In article <1989Jul7.022708.4826@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
# Actually, what might be better (from the point of view of trying to
# collect lots of articles before bothering the MAIL: people) is to parse
# a batch file.  For example, I have the following (C-news) sys file entry:
# 
#     maps:comp.mail.maps/all:f:
# 
# Which places the file name of each article in comp.mail.maps, and I
# have a cron entry that runs a script that pulls each file name out
# and unpacks it, calls pathalias and sends mail to me.  

In article <1989Jul13.160050.3478@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
# The main advantage is that you don't have to rummage around in the directory,
# possibly parse the files, and check your database to see whether you've
# already unpacked it.  You know that every single file listed in the batch
# is new and you've not seen it before.  In fact, with this approach you 
# NEVER* have to have rkive reread its own databases or scan directories - 
# the index files are merely logs of what things rkive's already snarfed, and
# the batch file is names of files that rkive hasn't read yet.  

Now I see what you meant. With this approach the .archived file could become
history.. :-) Currently, rkive reads the directory to get a file name and
then reads the .archived file to see if it has been already been archived.
If the filename is not found in the .archived file, the file is archived. With
your approach, I would not need to do that check...  Sorry, I'm just a little
slow some times... :-)

In article <523@ssbell.UUCP> I wrote:
# I made this a compile time decision by adding
# an ifdef around the getpwnam call.

Chris writes:
>Oops, I musta missed that somewhere.

No you didn't. That change is in the patch being posted today.

			-Kent+
---
Kent Landfield               UUCP:     kent@ssbell
Sterling Software FSG/IMD    INTERNET: kent@ssbell.uu.net
1404 Ft. Crook Rd. South     Phone:    (402) 291-8300 
Bellevue, NE. 68005-2969     FAX:      (402) 291-4362