kent@ssbell.UUCP (Kent Landfield) (06/02/89)
USENET Sources Archiver Tired of archiving sources by hand ?? Tired of missing postings to comp.sources.all just 'cause you finally took a day off ? Can't remember the filename that you saved those posted sources to ? Not too sure how long you will be able to hold that 'old archiver' together before the tires give out ? Ever wonder how people archive their comp.sources.all newsgroups ? Where are all these automated source archivers that you see mentioned on the net ? Where are they available from ? What do they do ? How can you *get* one ? If you answered yes to any of the above, don't feel alone. There are many people out there with the same problems. I know 'cause I was one of them...... It had gotten to the point of frustration. I had to do something! Shutting down the newsfeed was not a possibility, I was addicted... Running find /usr/spool/news/comp/sources -type f -print | xargs rm -f from cron just so I didn't see them was the cowards way out. I thought I had the answer, I'd ask the net! "Hey buddy, got an archiver ?" Week 1, three mail messages came in.. Week 2, four mail messages came in.. Week 3, one mail messages came in.. And then the messages stopped coming....... Did the received messages hold the answer .... ? "Uh, if you have any luck, could you please send the info on to me ? InadvanceThanks. I sure could use it too ya know.." I was at my wits end! Users were asking for posted sources that had just past through our machine and I DIDN'T HAVE IT.... Jumping was starting to look real good... Finally, on the verge of breakdown, I had an attack of logic. "I'm looking for a program. I am a programmer. If I can not find the program I want, I could write it!" For the next few days, I walked around the office mumbling encouragements to myself. People started to look at me funny... My secretary went to visit my boss after a week. Shortly there after I heard, "Kent, get in here!" After settling down in the rock hard folding chair in front of his desk, I could see he was disturbed. "Kent, Donna tells me that you seem to have a problem and that you are wandering around all day saying "I drink a can, I drink a can. What are you, drunk ?" "Only with an obsession sir", I responded. "What is that, some sort of imported beer ?" he asked. "No sir, what I have been saying is 'I think I can'." "Well just what is it that you think you can do ? " "Write a piece of code." I *knew* that I had made a mistake there... "WHAT! you are a programmer right? You better be able to write a piece of code. What the hell do I pay you for anyway?" I didn't answer that, my foot was squarely in my mouth. I *really* didn't want it shoved down my throat... "Now get out of here and go write that code!!" he said pointing his fat finger at the door. "THANK YOU sir!" I said as I backed out of the room. The rest, as you say is history.... So much for the **fiction** :-) :-) NOW for the FACTS :-) Last night I sent the rkive USENET Sources Archiver to rich salz for posting to comp.sources.unix. It should appear in a spool directory near you *real* soon now. (Is there a catch-22 here, need an archiver to catch the sources to an archiver so you can archive the archiver sources :-) ) The rkive package was initially designed for archiving comp.sources.all newsgroups. It does however, support archiving of non-moderated, non-sources newsgroups. What follows is a *long* explanation of just what it is and what it does. (This might be one to send to the printer and read later.. :-)) You have been warned! :-) ------------------------------------ rkive - A USENET Sources Archiver ------------------------------------ rkive is used to archive the USENET sources groups to an alternate location as specified in an rkive configuration file. Archives can be maintained in one of three ways: Archive-Name - The moderators of *most* sources groups assign an official Archive-Name to each article that gets submitted to the net. In this manner, each file has a "new-login" or "elm/part06" type of format. For multi-part postings, a sub- directory is created (as indicated in the elm example) to hold the separate "parts". This format is used by many large archive sites because it is easier for retrieval via mail request software such as netlib and the filenames give hints as to what the software is. Volume-Issue - Software sent via *most* moderated groups have an assigned Volume and Issue number. This allows the modera- tors to track and reference the individual items that have been posted to the group. Each individual article is given an "Issue" number. The Issues are grouped together into a "Volume". There are roughly 100 articles in each Volume but this is an arbitrary split totally up to the moderator. This format is extremely useful when the software archives are cataloged. It makes searching of the files quicker and verification of complete volumes easier. This archive format is recommended for any site that will be doing *massive* searches of the individual volumes since it keeps the qua- dratic nature of directory searches from making your life miserable. Article Number - The news software stores the articles locally by naming the news article by a number generated on every site. The Article Number ordering is unique to each site. If an Article Number archive is requested (or required by the newsgroup), the news article file is copied to the directory specified in the archive configuration file. The name of the archived article will match the original name generated by the news software. By means of a configuration file, the archive administrator is able to control how archiving is performed. The administrator can specify on a per newsgroup basis: o The type of the archiving, such as Volume-Issue Archive-Name, or Article Number archiving, o Where the newsgroup archive is to be stored on disk, o The location of log file for the newsgroup, o The format of the logfile records, o The location of index file for the newsgroup, o The format of the index file records, o A list of users to be sent mail when an article is archived, o The owner/group and modes of each archived member, and o Whether the archived members should be compressed or not. o How to deal with REPOSTs to archived members, o How to deal with patches to posted sources (only in newsroups that support the Patch-To: line), It is intended that rkive be run by cron on a daily basis. In this manner, the sources are archived and available for retrieval from the archives on the day it reaches the machine instead of having to wait for expire -a to run. It allows for the archives to be managed by the same or different people (or accounts). It supports the building of indexes for later review or to interface to the netlib type of mail retrieval software. It also supports mailing notifications of the archiving to a specified list of users or aliases. The indexes and log file record formats are specifiable by the person configuring the rkive configuration file. ---------------- REPOST Handling: ---------------- ADD_REPOST_SUFFIX This define allows the administrator to configure the software to add "-repost" (or whatever is defined in REPOST_SUFFIX) to the end of all files that are marked as REPOST by the newsgroup moderator. The suffix is added prior to compression. This feature should only be configured/exist on systems whose filename limits are greater than 14. MV_ORIGINAL This define allows the administrator to configure the software to move the original article into a "originals" directory in the problems directory. The inbound reposted article is placed into the archive in the correct position. If neither define is specified then the inbound article is placed into the archive in the correct position only if the initial article is not in the archive. Otherwise the reposted article is placed in the problems directory as normal duplicate articles are now. ----------------- PATCHES Handling: ----------------- rkive supports the new Auxiliary header "Patch-To:" that is going to be used in c.s.u. The Patch-To: line exists for articles that are patches to previously posted software. The Patch-To: line only appears in articles that are posted, "Official", patches. The initial postings do not contain the Patch-To: auxiliary header line. Auxiliary Headers For Patch Postings: Submitted-by: Kent Landfield <kent@ssbell.UUCP> Posting-number: Volume 23, Issue 14 -> Patch-To: Volume 22, Issue 122 Archive-name: rkive/patch1 There are two different types of handling with regards to patches. Package - This type of archiving of patches places the patches in the same directory that the initial source was posted to. This type of archiving is only available to newsgroup archives that are using Archive-Name archiving as well. Historical - This type of archiving patches is done by sites that want to place the the patches in the volume/issue directory as specified by the moderator when the patch was initially posted. Archive recognizes that the Patch-To: line indicates the article is a patch. For Archive-Name archiving which has specified "Package" patches archiving in the configuration file, rkive puts the article into the directory that contained the initial posting (volume22/rkive). For Archive-Name that has not specified Package archiving or for Volume/Issue archiving, the article would still be labeled as volume23/rkive/patch01 or volume23/v23i014 respectively. rkive also writes a .patchlog file in the BASEDIR for the newsgroup that is used to track patches to originally posted software. The .patchlog is going to be used for the "random software downloader :-)" so that complete software packages (sources and patches) can be requested from sites that do not use combined Archive-Name and Package archiving. The format of the .patchlog file is: # # Patchlog for comp.sources.whoknows # # Path To Initial Initial Current Current # Patchfile Volume Issue Volume Issue # bb/patch01 22 105 23 77 or if volume issue format.. v47i022 22 105 23 77 ------------------------- Article Header Reduction: ------------------------- Articles that are stored just as they arrived on your system are potentially wasting disk space. Certain rfc822/rfc1036 header lines are of little use after the article is archived. If you wish to have the headers "trimmed" when the file is archived, assure that REDUCE_HEADERS is defined. Currently all header lines that are *not* either; From:, Newsgroups:, Subject:, Message-ID:, and Date: will be removed. This can produce a savings of as much as 200 to 500 bytes per archived article. --------- Security: --------- rkive sets the ownership, group and modes on the archived members according to the information specified in the configuration file. Currently though, rkive uses the default umask for creating the log and index files. rkive will not archive files outside of the BASEDIR specified in the configuration file so a "prankster" can not do nasty things to your system files by having an Archive-name line like: Archive-name: ../../../../../../etc/passwd It will also not overwrite duplicate files. They are stored underneath the problems directory specified in the configuration file. The admin is alerted to the fact and it then becomes a manual cleanup problem. ------------------------------------------------- article - Format News Article Header Information ------------------------------------------------- The accompanying article program allows you to view the article headers in much the same manner that you use a printf statement. This was initially done for debugging purposes but I quickly found that it was *extremely* useful in dealing with news articles in general. It works great in shell scripts to view articles that need to be read.... Also super for perusing the archives directly and generating indexes to the archives in *many* different ways...:-) -------- CREDITS: -------- I have to give credit where credit is due 'cause they earned it! I used the code in header.c of the News 2.11 as the basis of ideas for dealing with the article headers. The code I have written is not the same but most of the concepts and some of the flow control resulted from reviewing how it was "suppose to be done". (rfcs only go so far.. :-)) For that I thank rick adams and the authors of news for the *excellent* code to study from.. :-) I would also like to thank my beta testers for the headaches of dealing with me, with forcing different ideas on me at a time when I was "almost" willing to listen :-) and for the many different "full redistribution of sources" every time I had a new version. Specifically I want to thank eric@amperif (Eric Johnson) and denny@mcmi (Denny Page) for putting up with me.. :-) and for rick@ssbell (Rick Ohnemus) for his excellent debugging techniques.. :-) ------------------------------------------------------------------------ This software set was developed under an archiving model similar to that maintained currently on uunet. It was intended that the archiving facilities were more of a "site" facility and not an individuals facility. (That is unless the individual owned the site :-)). I have not tried to use rkive for maintaining a private (many dups on a single machine) archive. There does not seem to be any reason why it would not work. It just hasn't been done. rkive will accept an rkive.cf file specified on the command line so it would be possible for an individual to have their own mini archive directory structure. This is *not* recommended if the site is doing archiving since the software will store multiple copies thus wasting more disk space than it is worth. Aside from that, if someone does try it, let me know how it turns out. :-) :-) VERY IMPORTANT! If you have a problem, there's someone else out there who either has had or will have the same problem. Please send all patches, ideas, etc to kent@ssbell (or uunet!ssbell!kent) so that I can continue to improve the functionality and portability of this package. kent@ssbell.UUCP uunet!ssbell!kent