jos@idca.tds.PHILIPS.nl (Jos Vos) (07/05/89)
After reading quickly through the accompanying documentation of rkive, I have the following remarks (please don't flame if an answer can be found in the documentation - I read it *quickly*): - First of all, it looks GREAT. I hope to start using it soon! Besides sources I want (have) to use it for a number of "normal" newsgroups too. - It is mentioned in rkive(1) that an existing file is (by default) not overwritten. What happens then? This area of problems is also indicated in the IDEAS file: a more flexible naming scheme should be possible besides the article number (and the other two). E.g. a format string using % notations for time parameters (day, hour, seconds). Also enabling the use of a user program that generates the filename (without the directory) would be a possibility: this is the most generic way and quite easy to implement (but not efficient...). - How it is known whether an article is already archived? The previous problem becomes BIG if it can only be concluded that an article is already archived because the file exists... - How are crosspostings handled? - Is it not possible to use rkive as a program directly from the sys file (that is, with the article as stdin)? Probably not (the first problem SHOULD be solved then). I think this is a much cleaner way of archiving the news, isn't it? (who knows what happens with /usr/spool/news before tonight :-)) I know I could find the answer of some questions in the code, but I didn't have time to look at that now. And besides that, much people (?) will sooner or later have the same questions. -- -- ###### Jos Vos ###### Internet jos@idca.tds.philips.nl ###### -- ###### ###### UUCP ...!mcvax!philapd!jos ######
kent@ssbell.UUCP (Kent Landfield) (07/06/89)
In article <1123@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes: >After reading quickly through the accompanying documentation of rkive, >I have the following remarks (please don't flame if an answer can be >found in the documentation - I read it *quickly*): > >- First of all, it looks GREAT. I hope to start using it soon! > Besides sources I want (have) to use it for a number of "normal" > newsgroups too. > >- It is mentioned in rkive(1) that an existing file is (by default) > not overwritten. What happens then? rkive handles it differently depending on whether the article is a REPOST or not. If rkive detects that the destination (or target file) name exists and the article is a .... NON-REPOST Article: In the event that any duplicate is encountered, rkive creates a problems directory (if necessary) as specified in the PROBLEMS line of the rkive.cf configuration file. It then stores the inbound article in the problems directory within a subdirectory that reflects the name of the newsgroup the duplicate was found in. The archive administrator(s) specified in the rkive.cf are mailed a message indicating what has occured. The original in the archive is not overwritten. The duplicate then becomes a matter of manual cleanup. REPOST Article: Depending on how the software is compiled... REPOSTS are handled in one of three ways currently. In all three methods the archive administrator is notified of the occurrance via e-mail. MV_ORIGINAL The original article is placed (moved) into a subdirectory in the problems directory named "Originals". The inbound reposted article is then placed into the archive in the correct position. (My favorite..:-)) ADD_REPOST_SUFFIX If ADD_REPOST_SUFFIX is defined, all reposts will have the string specified in the REPOST_SUFFIX define appended to the archive filename so that a repost of elm/part07 would appear in the archive as elm/part07-repost prior to any compression. (Careful with this one folks..) No Reposting Defines specified: The inbound article would be placed into the archive in the correct position only if the initial article is not in the archive. Otherwise the reposted article is placed in the problems directory as a normal duplicate article is now. > This area of problems is also indicated in the IDEAS file: a > more flexible naming scheme should be possible besides the > article number (and the other two). E.g. a format > string using % notations for time parameters (day, hour, seconds). > Also enabling the use of a user program that generates the filename > (without the directory) would be a possibility: this is the most > generic way and quite easy to implement (but not efficient...). The IDEAS file describes the need for an alternate way to archive newsgroups that do not support the auxiliary headers. This is necessary since the Article-Number method uses the "news subsystem" naming scheme. If a news system numbering was restarted from scratch or the entire archive was moved to a different machine, problems could occur due to the potential for duplicate filenames. This is *not* something that you do everyday but it is a problem that *can* be avoided. A patch is in testing right now to be released next week that has an additional method of archiving. Chronological archiving support has been added which allows articles to be archived in a format of... volumeYY/MOY/YYMMDD.II or volumeYY/YYMMDD.II where YY - two digit year, MOY - Jun, Jul etc (table configurable), MM - two digit month DD - two digit day II - daily issue number which represents the number of the article in the order of processing. example: volume89/Jul/890706.01 or volume89/890706.01 I agree a generic hook is needed for the actual storage vehicle so as to support new methods over distributed media. That is in the works although *any* and *all* ideas are welcome and encouraged... >- How it is known whether an article is already archived? > > The previous problem becomes BIG if it can only be concluded > that an article is already archived because the file exists... The test as to whether an article is already archived is done by checking if the archive file exists. I'm not sure what you mean by BIG. I have running rkive since Feburary and I have not moved my archive to another machine or restarted my News numbering once. :-) (Wait till I put up Cnews though :-)) Please remember, this archiver was initially designed as a sources archiver. I have added the Chronological method which solves the problems of restarting the news system and moving the archive that could have been a problem with Article-Number archiving. You can now archive non-sources groups just as effectively as sources groups. Well, as soon as the patch is posted next week.. :-) >- How are crosspostings handled? Currently, crosspostings are *not* handled. rkive archives the newsgroups that you specify in the rkive.cf configuration file. It blindly ignores crosspostings and worries only about the target newsgroup. What does this mean ? If you have specified that you wish to archive comp.sources.unix and comp.sources.d and the monthly informational posting goes out, you will currently get *two* copies..... This is a recognized deficiency. It needs to check to see if any of the crossposted groups are being archived as well and attempt to link the files. I say attempt since my archives here at ssbell reside on 4 different filesystems and as soon as I finish the distributed version, they will be scattered on as many machines. :-) >- Is it not possible to use rkive as a program directly > from the sys file (that is, with the article as stdin)? > Probably not (the first problem SHOULD be solved then). No. rkive is meant to run from cron and not receive the articles from stdin. To be quite honest, I never really thought about doing it that way but if I ... :-) :-) Currently, that is not in the works. > I think this is a much cleaner way of archiving the news, isn't it? > (who knows what happens with /usr/spool/news before tonight :-)) On my machines, I know... :-) >I know I could find the answer of some questions in the code, but >I didn't have time to look at that now. And besides that, much >people (?) will sooner or later have the same questions. Please, ask away! I *expected* that I would be answering questions. Better sooner than later. I have been receiving some *GREAT* ideas from the net as to ways to improve and enhance rkive's functionality. Thanks! I answer my mail so if you have not gotten an answer back, I probably didn't get it. I am planning on posting the patch to comp.sources.bugs and sending a copy as well to rich. Distributed archiving is next on my list. Also the "random software downloader" for retrieving complete packages, patches and all, is in development. Anyone want to help me name the "random software downloader" ? get is already taken and rsd sounds so bland.. :-) :-) >-- ###### Jos Vos ###### Internet jos@idca.tds.philips.nl ###### Thanks Jos! -Kent+ --- Kent Landfield UUCP: kent@ssbell Sterling Software FSG/IMD INTERNET: kent@ssbell.uu.net 1404 Ft. Crook Rd. South Phone: (402) 291-8300 Bellevue, NE. 68005-2969 FAX: (402) 291-4362
jos@idca.tds.PHILIPS.nl (Jos Vos) (07/06/89)
In article <520@ssbell.UUCP> kent@ssbell.UUCP (Kent Landfield) writes: >In article <1123@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes: >A patch is in testing right now to be released next week that has an >additional method of archiving. .... > .... >example: > volume89/Jul/890706.01 or volume89/890706.01 I already looked in the code and saw that it would be quite easy to add an archive method that popen's a user program (specified in some way) that puts a plain filename on stdout. Than I can play with SysV date +%... as much as I like. The next step could be to let that program generating a relative pathname, in that case the program could be just a script with date '+volume%y/%h/%y%m%d' Only the problem with the .seq suffix should be solved then, if you *want* to solve it. >The test as to whether an article is already archived is done by checking >if the archive file exists. I'm not sure what you mean by BIG. ... I only meant that the previous item (generating own filenames) would cause that you can't detect an article --- filename-for-saving relation anymore. Doesn't that problem also occur in your proposed scheme? What about the .archived file mentioned somewhere in the documentation? >>- How are crosspostings handled? >Currently, crosspostings are *not* handled. ... It becomes a problem if you want to use rkive for archiving a lot of newsgroups, i.e. not only sources. But I can imagine it's quite difficult to handle that correctly (w.r.t. the rkive.cf file's lists). >>- Is it not possible to use rkive as a program directly >> from the sys file (that is, with the article as stdin)? >No. rkive is meant to run from cron and not receive the articles from stdin. Still a nice item for the IDEAS file :-) -- -- ###### Jos Vos ###### Internet jos@idca.tds.philips.nl ###### -- ###### ###### UUCP ...!mcvax!philapd!jos ######
clewis@eci386.UUCP (07/07/89)
I've just started up rkive and like it. Thanks. But I do have a few preliminary comments - one already touched on: In article <520@ssbell.UUCP> kent@ssbell.UUCP (Kent Landfield) writes: >In article <1123@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes: >>- Is it not possible to use rkive as a program directly >> from the sys file (that is, with the article as stdin)? >> Probably not (the first problem SHOULD be solved then). >No. rkive is meant to run from cron and not receive the articles from stdin. >To be quite honest, I never really thought about doing it that way but >if I ... :-) :-) Currently, that is not in the works. Actually, what might be better (from the point of view of trying to collect lots of articles before bothering the MAIL: people) is to parse a batch file. For example, I have the following (C-news) sys file entry: maps:comp.mail.maps/all:f: Which places the file name of each article in comp.mail.maps, and I have a cron entry that runs a script that pulls each file name out and unpacks it, calls pathalias and sends mail to me. [I'm sending it off to comp.sources.misc tonight] This allows you to schedule when rkive runs, and it isn't dependent on expiration (much). Receiving on stdin could be rather unpleasant w.r.t. performance at times.... The other main deficiency that I discovered so far is that you try *so* hard to ensure that the .cf file is correct, that you don't allow some additional niceties. For example, each entry in "MAIL:" is verified by calls to getpwnam(). That means that all three entries will fail validation: MAIL: eci386!clewis, clewis@eci386, alias-in-global-alias-file -- Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc. UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis Phone: (416)-595-5425
kent@ssbell.UUCP (Kent Landfield) (07/12/89)
First off, sorry for the delay in getting this response out. I finallly took a *real* vacation. First in at least 3 years... :-) Anyway...... In article <1129@ssp15.idca.tds.philips.nl> jos@idca.tds.PHILIPS.nl (Jos Vos) writes: # I already looked in the code and saw that it would be quite easy # to add an archive method that popen's a user program (specified in some # way) that puts a plain filename on stdout. # Than I can play with SysV date +%... as much as I like. # The next step could be to let that program generating a relative # pathname, in that case the program could be just a script with # date '+volume%y/%h/%y%m%d' # Only the problem with the .seq suffix should be solved then, # if you *want* to solve it. I see that the "hook" is apparent to you as well... :-) In article <520@ssbell.UUCP> I wrote: # The test as to whether an article is already archived is done by checking # if the archive file exists. I'm not sure what you mean by BIG. ... Jos Vos writes: # I only meant that the previous item (generating own filenames) would # cause that you can't detect an article --- filename-for-saving relation # anymore. Doesn't that problem also occur in your proposed scheme? # What about the .archived file mentioned somewhere in the documentation? The problem does not occur with the Chronological archiving. As for the .archived file, it is used to indicate what the status is of the articles currently on the system waiting to be expired. If an article is expired, the entry in the .archived file is removed. In this way the .archived file for the newsgroup is self-maintaining (i.e. it does not grow out of bounds) and is *not* a log of all previous "article number - archive resting place" entries. The .archived file is used so that an article is only archived once, the first time rkive is run after the article reaches the system. I wrote: # Currently, crosspostings are *not* handled. ... Jos Vos writes: # It becomes a problem if you want to use rkive for archiving a lot of # newsgroups, i.e. not only sources. But I can imagine it's quite # difficult to handle that correctly (w.r.t. the rkive.cf file's lists). No, it really will not be too hard since I just need to check if the group the article is crossposted to is also being archived and act accordingly. At this point it just has not been done because I have not been able to find the time. It is currently not a problem as long as you have disk space. :-) :-) You will not lose any articles, just duplicate them. :-( Jos Vos writes: # Is it not possible to use rkive as a program directly # from the sys file (that is, with the article as stdin)? I wrote: # No. rkive is meant to run from cron and not receive the articles from stdin. Jos Vos writes: # Still a nice item for the IDEAS file :-) Granted. The only problems that I see here is that I need some information about where to archive, modes, owners etc that could make the command line real ugly... :-) It can be worked but it is not real high on the priority list. It has been put into the IDEAS file for now... -Kent+ --- Kent Landfield UUCP: kent@ssbell Sterling Software FSG/IMD INTERNET: kent@ssbell.uu.net 1404 Ft. Crook Rd. South Phone: (402) 291-8300 Bellevue, NE. 68005-2969 FAX: (402) 291-4362
kent@ssbell.UUCP (Kent Landfield) (07/12/89)
In article <1989Jul7.022708.4826@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
# Actually, what might be better (from the point of view of trying to
# collect lots of articles before bothering the MAIL: people) is to parse
# a batch file. For example, I have the following (C-news) sys file entry:
#
# maps:comp.mail.maps/all:f:
#
# Which places the file name of each article in comp.mail.maps, and I
# have a cron entry that runs a script that pulls each file name out
# and unpacks it, calls pathalias and sends mail to me.
#
# This allows you to schedule when rkive runs, and it isn't dependent
# on expiration (much). Receiving on stdin could be rather unpleasant
# w.r.t. performance at times....
Ok. I think I am being dumb here (it would not be a first) but I don't see
how this is really any different then what rkive does now. I can schedule
rkive to run via cron any time I wish and with as much frequency. The
difference is that this would get the file names from a different file/stdin
where as the current rkive gets the file names from the news directory
structure. You are still dependent on expire since the file specified
must still exist in both the current and this approach when it is time to
"rkive" the file. Like I said, I am probably just missing the point.
# The other main deficiency that I discovered so far is that you try
# so* hard to ensure that the .cf file is correct, that you don't allow
# some additional niceties. For example, each entry in "MAIL:" is
# verified by calls to getpwnam(). That means that all three entries
# will fail validation:
# MAIL: eci386!clewis, clewis@eci386, alias-in-global-alias-file
Yep. Try to check too little and you fall in a black hole. Try to check
to much and you choke off flexible usage. Somewhere in between is a path
on which you can breath... I made this a compile time decision by adding
an ifdef around the getpwnam call.
-Kent+
---
Kent Landfield UUCP: kent@ssbell
Sterling Software FSG/IMD INTERNET: kent@ssbell.uu.net
1404 Ft. Crook Rd. South Phone: (402) 291-8300
Bellevue, NE. 68005-2969 FAX: (402) 291-4362
jos@idca.tds.PHILIPS.nl (Jos Vos) (07/12/89)
In article <521@ssbell.UUCP> kent@ssbell.UUCP (ssbell Admin) writes: >The problem does not occur with the Chronological archiving. As for the >.archived file, it is used to indicate what the status is of the articles >currently on the system waiting to be expired. ... A more general way of registrating the archived articles is the combination of the Message-Id and the Posting-Date. That's quite unique *forever*, I hope :-) (Any other suggestions are welcome, but I couldn't find something else). Then crosspostings, feeding via stdin etc. are then quite easy to handle. If you divide your databases into parts (e.g. a file 89 for a full year's database, or 8901, 8902, ... for monthly databases) according to the posting dates you can easy check whether an article is already archived in the stone age (i.e. before 1-1-1970 :-) ). At expire time, it's still possible (if you want that) to minimize the database file(s) according to the expired articles. If you don't want it, you'll get a real usefull history. I'll work in out in more detail (I think I need it anyway and I probably will implement it for my own private use) and post it. Then you and The NET can decide what to do with it. -- -- ###### Jos Vos ###### Internet jos@idca.tds.philips.nl ###### -- ###### ###### UUCP ...!mcvax!philapd!jos ######
clewis@eci386.uucp (Chris Lewis) (07/13/89)
In article <523@ssbell.UUCP> kent@ssbell.UUCP (Kent Landfield) writes: >In article <1989Jul7.022708.4826@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: ># Actually, what might be better (from the point of view of trying to ># collect lots of articles before bothering the MAIL: people) is to parse ># a batch file. For example, I have the following (C-news) sys file entry: ># ># maps:comp.mail.maps/all:f: ># ># Which places the file name of each article in comp.mail.maps, and I ># have a cron entry that runs a script that pulls each file name out ># and unpacks it, calls pathalias and sends mail to me. > >Ok. I think I am being dumb here (it would not be a first) but I don't see >how this is really any different then what rkive does now. I can schedule >rkive to run via cron any time I wish and with as much frequency. The >difference is that this would get the file names from a different file/stdin >where as the current rkive gets the file names from the news directory >structure. You are still dependent on expire since the file specified >must still exist in both the current and this approach when it is time to >"rkive" the file. Like I said, I am probably just missing the point. Thinking more on it, the expire argument is probably bogus, but: The main advantage is that you don't have to rummage around in the directory, possibly parse the files, and check your database to see whether you've already unpacked it. You know that every single file listed in the batch is new and you've not seen it before. In fact, with this approach you *NEVER* have to have rkive reread its own databases or scan directories - the index files are merely logs of what things rkive's already snarfed, and the batch file is names of files that rkive hasn't read yet. Though, of course you do have to be fairly careful not to clobber things if they reappear, and you have to read the control file to decide what to do with each one. [This discussion is probably moot because you've already implemented a "fancy" version - what really bugs me is the map unpackers that people write that go into the comp.mail.maps directory and runs pathalias *only* on what's in comp.mail.maps. Missing expired entries, getting duplicate copies of maps (when you don't have supercede or someone goofed), and being unable to compress the map files. And, chances are, running as root and someone put a trojan into one of the maps...] [re: MAIL: destination checking] >... I made this a compile time decision by adding >an ifdef around the getpwnam call. Oops, I musta missed that somewhere. -- Chris Lewis, R.H. Lathwell & Associates: Elegant Communications Inc. UUCP: {uunet!mnetor, utcsri!utzoo}!lsuc!eci386!clewis Phone: (416)-595-5425
kent@ssbell.UUCP (Kent Landfield) (07/14/89)
In article <1989Jul7.022708.4826@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: # Actually, what might be better (from the point of view of trying to # collect lots of articles before bothering the MAIL: people) is to parse # a batch file. For example, I have the following (C-news) sys file entry: # # maps:comp.mail.maps/all:f: # # Which places the file name of each article in comp.mail.maps, and I # have a cron entry that runs a script that pulls each file name out # and unpacks it, calls pathalias and sends mail to me. In article <1989Jul13.160050.3478@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: # The main advantage is that you don't have to rummage around in the directory, # possibly parse the files, and check your database to see whether you've # already unpacked it. You know that every single file listed in the batch # is new and you've not seen it before. In fact, with this approach you # NEVER* have to have rkive reread its own databases or scan directories - # the index files are merely logs of what things rkive's already snarfed, and # the batch file is names of files that rkive hasn't read yet. Now I see what you meant. With this approach the .archived file could become history.. :-) Currently, rkive reads the directory to get a file name and then reads the .archived file to see if it has been already been archived. If the filename is not found in the .archived file, the file is archived. With your approach, I would not need to do that check... Sorry, I'm just a little slow some times... :-) In article <523@ssbell.UUCP> I wrote: # I made this a compile time decision by adding # an ifdef around the getpwnam call. Chris writes: >Oops, I musta missed that somewhere. No you didn't. That change is in the patch being posted today. -Kent+ --- Kent Landfield UUCP: kent@ssbell Sterling Software FSG/IMD INTERNET: kent@ssbell.uu.net 1404 Ft. Crook Rd. South Phone: (402) 291-8300 Bellevue, NE. 68005-2969 FAX: (402) 291-4362