[comp.sources.d] USENET Sources Archiver

kent@ssbell.UUCP (Kent Landfield) (06/02/89)

                 USENET Sources Archiver             

Tired of archiving sources by hand ??  Tired of missing postings to 
comp.sources.all just 'cause you finally took a day off ?  Can't
remember the filename that you saved those posted sources to ? 
Not too sure how long you will be able to hold that 'old archiver'
together before the tires give out ?

Ever wonder how people archive their comp.sources.all newsgroups ?
Where are all these automated source archivers that you see mentioned
on the net ?  Where are they available from ?  What do they do ?  How 
can you *get* one ?

If you answered yes to any of the above, don't feel alone.  There are
many people out there with the same problems.  I know 'cause I was
one of them......  

It had gotten to the point of frustration.  I had to do something!  
Shutting down the newsfeed was not a possibility, I was addicted...  
Running 

    find /usr/spool/news/comp/sources -type f -print | xargs rm -f

from cron just so I didn't see them was the cowards way out. 
I thought I had the answer, I'd ask the net! 

             "Hey buddy, got an archiver ?"

Week 1, three mail messages came in..
Week 2, four mail messages came in..
Week 3, one mail messages came in..
And then the messages stopped coming.......

Did the received messages hold the answer .... ? 

         "Uh, if you have any luck, could you please send
         the info on to me ?  InadvanceThanks.  I sure could
         use it too ya know.."

I was at my wits end!  Users were asking for posted sources that had just
past through our machine and I DIDN'T HAVE IT....  Jumping was starting
to look real good...

Finally, on the verge of breakdown, I had an attack of logic.
"I'm looking for a program.  I am a programmer. If I can not find the
program I want, I could write it!"  For the next few days, I walked around the
office mumbling encouragements to myself.  People started to look at me
funny...   My secretary went to visit my boss after a week.  Shortly
there after I heard, "Kent, get in here!"   After settling down in the
rock hard folding chair in front of his desk, I could see he was disturbed.
"Kent, Donna tells me that you seem to have a problem and that you are
wandering around all day saying "I drink a can, I drink a can.  What are
you, drunk ?"  "Only with an obsession sir", I responded.  "What is that,
some sort of imported beer ?" he asked.  "No sir, what I have been saying
is 'I think I can'."  "Well just what is it that you think you can do ? "
"Write a piece of code."  I *knew* that I had made a mistake there... 
"WHAT! you are a programmer right?  You better be able to write a piece of 
code.  What the hell do I pay you for anyway?"  I didn't answer that, my 
foot was squarely in my mouth.  I *really* didn't want it shoved down my 
throat...  "Now get out of here and go write that code!!" he said pointing 
his fat finger at the door.  "THANK YOU sir!" I said as I backed out of the 
room.   The rest, as you say is history....

So much for the **fiction** :-) :-) NOW for the FACTS :-)

Last night I sent the rkive USENET Sources Archiver to rich salz for 
posting to comp.sources.unix.  It should appear in a spool directory
near you *real* soon now.  (Is there a catch-22 here, need an archiver to
catch the sources to an archiver so you can archive the archiver 
sources :-) )

The rkive package was initially designed for archiving comp.sources.all 
newsgroups.  It does however, support archiving of non-moderated, 
non-sources newsgroups.  What follows is a *long* explanation of just 
what it is and what it does. (This might be one to send to the printer
and read later.. :-))  You have been warned! :-) 
  
          ------------------------------------
          rkive    - A USENET Sources Archiver
          ------------------------------------

rkive is used to archive the USENET sources groups to an alternate 
location as specified in an rkive configuration file.  Archives can be 
maintained in one of three ways:

     Archive-Name - The moderators of *most* sources groups assign
         an official Archive-Name to each article that gets submitted
         to the net.  In this manner, each file has a "new-login" or
         "elm/part06" type of format.  For multi-part postings, a sub-
         directory is created (as indicated in the elm example) to
         hold the separate "parts".  This format is used by many large
         archive sites because it is easier for retrieval via mail
         request software such as netlib and the filenames give hints
         as to what the software is.

     Volume-Issue - Software sent via *most* moderated groups have
         an assigned Volume and Issue number.  This allows the modera-
         tors to track and reference the individual items that have
         been posted to the group.  Each individual article is given
         an "Issue" number.  The Issues are grouped together into a
         "Volume".  There are roughly 100 articles in each Volume but
         this is an arbitrary split totally up to the moderator.
         This format is extremely useful when the software archives
         are cataloged.  It makes searching of the files quicker and
         verification of complete volumes easier.  This archive format
         is recommended for any site that will be doing *massive*
         searches of the individual volumes since it keeps the qua-
         dratic nature of directory searches from making your life
         miserable.

     Article Number - The news software stores the articles
         locally by naming the news article by a number generated on
         every site.  The Article Number ordering is unique to each
         site.  If an Article Number archive is requested (or required
         by the newsgroup), the news article file is copied to the
         directory specified in the archive configuration file.  The
         name of the archived article will match the original name
         generated by the news software.

By means of a configuration file, the archive administrator is able to 
control how archiving is performed.  The administrator can specify on a
per newsgroup basis:

   o The type of the archiving, such as Volume-Issue
     Archive-Name, or Article Number archiving,
   o Where the newsgroup archive is to be stored on disk,
   o The location of log file for the newsgroup,
   o The format of the logfile records,
   o The location of index file for the newsgroup,
   o The format of the index file records,
   o A list of users to be sent mail when an article is archived,
   o The owner/group and modes of each archived member, and
   o Whether the archived members should be compressed or not.
   o How to deal with REPOSTs to archived members,
   o How to deal with patches to posted sources (only in
     newsroups that support the Patch-To: line),

It is intended that rkive be run by cron on a daily basis.  In this manner,
the sources are archived and available for retrieval from the archives on the
day it reaches the machine instead of having to wait for expire -a to run.
It allows for the archives to be managed by the same or different people 
(or accounts).  It supports the building of indexes for later review or to 
interface to the netlib type of mail retrieval software.  It also supports 
mailing notifications of the archiving to a specified list of users or 
aliases.  The indexes and log file record formats are specifiable by the 
person configuring the rkive configuration file.

----------------
REPOST Handling:
----------------
ADD_REPOST_SUFFIX 
    This define allows the administrator to configure the software to
    add "-repost" (or whatever is defined in REPOST_SUFFIX) to the
    end of all files that are marked as REPOST by the newsgroup moderator.
    The suffix is added prior to compression.  This feature should only be 
    configured/exist on systems whose filename limits are greater than 14.

MV_ORIGINAL
    This define allows the administrator to configure the software to
    move the original article into a "originals" directory in the 
    problems directory.  The inbound reposted article is placed into 
    the archive in the correct position.

If neither define is specified then the inbound article is placed into 
the archive in the correct position only if the initial article is not 
in the archive.  Otherwise the reposted article is placed in the problems 
directory as normal duplicate articles are now.

-----------------
PATCHES Handling:
-----------------
rkive supports the new Auxiliary header "Patch-To:" that is going to
be used in c.s.u.  The Patch-To: line exists for articles that are patches 
to previously posted software.  The Patch-To: line only appears in articles 
that are posted, "Official", patches. The initial postings do not contain 
the Patch-To: auxiliary header line.

Auxiliary Headers For Patch Postings:

     Submitted-by: Kent Landfield <kent@ssbell.UUCP>
     Posting-number: Volume 23, Issue 14
->   Patch-To: Volume 22, Issue 122
     Archive-name: rkive/patch1

There are two different types of handling with regards to patches. 

 Package     - This type of archiving of patches places the patches
               in the same directory that the initial source was
               posted to.  This type of archiving is only available
               to newsgroup archives that are using Archive-Name
               archiving as well.
                   
 Historical  - This type of archiving patches is done by sites that 
               want to place the the patches in the volume/issue 
               directory as specified by the moderator when the patch
               was initially posted.

Archive recognizes that the Patch-To: line indicates the article is 
a patch.  For Archive-Name archiving which has specified "Package" 
patches archiving in the configuration file, rkive puts the article 
into the directory that contained the initial posting (volume22/rkive). 
For Archive-Name that has not specified Package archiving or for 
Volume/Issue archiving, the article would still be labeled as
volume23/rkive/patch01 or volume23/v23i014 respectively.

rkive also writes a .patchlog file in the BASEDIR for the newsgroup
that is used to track patches to originally posted software.  The
.patchlog is going to be used for the "random software downloader :-)"
so that complete software packages (sources and patches) can be requested
from sites that do not use combined Archive-Name and Package archiving.
The format of the .patchlog file is:
#
# Patchlog for comp.sources.whoknows
#
# Path To         Initial  Initial     Current Current 
# Patchfile       Volume   Issue       Volume  Issue
#
bb/patch01          22     105           23    77
            or if volume issue format..
v47i022             22     105           23    77

-------------------------
Article Header Reduction:
-------------------------
Articles that are stored just as they arrived on your system are potentially
wasting disk space.  Certain rfc822/rfc1036 header lines are of little use
after the article is archived.  If you wish to have the headers "trimmed" 
when the file is archived, assure that REDUCE_HEADERS is defined.  Currently 
all header lines that are *not* either;

    From:, Newsgroups:, Subject:, Message-ID:, and Date:

will be removed.  This can produce a savings of as much as 200 to 500 
bytes per archived article.

---------
Security:
---------
rkive sets the ownership, group and modes on the archived members according
to the information specified in the configuration file.  Currently though,
rkive uses the default umask for creating the log and index files.

rkive will not archive files outside of the BASEDIR specified in the 
configuration file so a "prankster" can not do nasty things to your
system files by having an Archive-name line like:
     Archive-name: ../../../../../../etc/passwd

It will also not overwrite duplicate files.  They are stored underneath
the problems directory specified in the configuration file.  The admin 
is alerted to the fact and it then becomes a manual cleanup problem.

        -------------------------------------------------
        article  - Format News Article Header Information
        -------------------------------------------------

The accompanying article program allows you to view the article headers in 
much the same manner that you use a printf statement.  This was initially 
done for debugging purposes but I quickly found that it was *extremely*
useful in dealing with news articles in general.  It works great in shell 
scripts to view articles that need to be read....  Also super for perusing 
the archives directly and generating indexes to the archives in *many* 
different ways...:-)

--------
CREDITS:
--------
I have to give credit where credit is due 'cause they earned it!

I used the code in header.c of the News 2.11 as the basis of ideas for
dealing with the article headers. The code I have written is not the same
but most of the concepts and some of the flow control resulted from reviewing
how it was "suppose to be done". (rfcs only go so far.. :-)) For that I
thank rick adams and the authors of news for the *excellent* code to study
from.. :-)

I would also like to thank my beta testers for the headaches of dealing
with me, with forcing different ideas on me at a time when I was "almost"
willing to listen :-) and for the many different "full redistribution of 
sources" every time I had a new version. Specifically I want to thank 
eric@amperif (Eric Johnson) and denny@mcmi (Denny Page) for putting up 
with me.. :-) and for rick@ssbell (Rick Ohnemus) for his excellent debugging
techniques.. :-)
------------------------------------------------------------------------
This software set was developed under an archiving model similar to
that maintained currently on uunet. It was intended that the archiving
facilities were more of a "site" facility and not an individuals
facility. (That is unless the individual owned the site :-)). I have
not tried to use rkive for maintaining a private (many dups on a single machine)
archive. There does not seem to be any reason why it would not work. It
just hasn't been done. rkive will accept an rkive.cf file specified on the
command line so it would be possible for an individual to have their own
mini archive directory structure. This is *not* recommended if the site is
doing archiving since the software will store multiple copies thus wasting
more disk space than it is worth.  Aside from that, if someone does try it,
let me know how it turns out. :-) :-)

    VERY IMPORTANT! If you have a problem, there's someone else out there 
    who either has had or will have the same problem.  Please send all 
    patches, ideas, etc to kent@ssbell (or uunet!ssbell!kent) so that I 
    can continue to improve the functionality and portability of this 
    package. 

			kent@ssbell.UUCP
			uunet!ssbell!kent