[comp.archives] Comp.archives database format

bill@twwells.uucp (T. William Wells) (10/25/88)

Contained herein is my first attempt at the database structure which
comp.archives is intended to be the input to. I am also going to
describe the comp.archives postings used to maintain the database.
None of this is cast in stone and critiques are welcome.

Here is the example archive site entry from my previous message.
Following it is a line-by-line description.

NM twwells.UUCP
EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21
AD bill@twwells.UUCP (T. William Wells)
MA 781 W. Oakland Pk Blvd #208, Ft. Lauderdale FL 33311
CO uucp:uucp::twwells Any1800-0800 ACU 2400 13059876543 in:-\r-in: arcuucp
DE This is where comp.archives gets moderated from. I maintain the
DE most up-to-date version of the databases, so if you want
DE them you have to get them directly from me.

NM twwells.UUCP

      This is the site name.

EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21

      This is the person responsible for the entry and the date on
      which the entry was added or updated.

AD bill@twwells.UUCP (T. William Wells)

      This is the person who is responsible for the archive.  He
      may or may not be the uucp, news, or system administrator.
      There can be more than one of these.

MA 781 W. Oakland Pk Blvd #208, Ft. Lauderdale FL 33311

      The mailing address for help or information.  Don't include
      this unless you want snail-mail.  People who mail to this
      address had better include a SASE or e-mail address or forget
      about getting any response.

CO uucp:uucp:~:twwells Any1800-0800 ACU 2400 13059876543 in:-\r-in: arcuucp

      This contains the information needed to access the archive.
      There can be several of these, depending on how many ways
      your site can be accessed.

      Each line starts with a tag that identifies the access method.
      This is used when not all of your archived information is
      available through all paths to your site.  For example, you
      might have a mail based server for small items but require a
      direct link for larger things.  Each item that you list as
      available through your archive has a tag that is used to
      indicate which way it can be accessed.

      There may be more than one line for a single tag. This would
      mean that there is more than one way to get to the same set
      of information.

      The next field describes the access method.  This would be
      something like "uucp", or "ftp", or "mail", or whatever.

      The remaining fields depend on the access method.  Since I am
      only familiar with uucp, I am only going to describe the
      fields for it.  I definitely want input on what is necessary
      for other access methods.

      There are two fields for uucp access.  The first is the path
      name which archive file names are relative to. The second is
      an L.sys entry that would be used to access your site.

DE This is where comp.archives gets moderated from. I maintain the
DE most up-to-date version of the databases, so if you want
DE them you have to get them direct from me.

      This is a short description of your site. You might also
      include any special information about your archives; for
      example, if you are willing to make tapes you would say so
      here.

---

Here is a sample entry for the archived information database.  Note
that I made this up from a cursory examination of Pcomm, don't take
it as gospel.

NM unix-pcomm
VR version 1.1
AU egray@fthood.UUCP (Emmet P. Gray)
MA egray@fthood.UUCP (Emmet P. Gray)
EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21
TT public domain version of ProComm (TM)
KW all-source,public-domain,datacomm
SY any:modem,sysv-unix:termcaps,install
DE Pcomm is a public domain telecommunication program for Unix that
DE is designed to operate similar to the MSDOS program, ProComm.
DE ProComm (TM) is copyrighted by Datastorm Technologies, Inc.  This
DE is a completely new program and contains no ProComm source code.
DE This is not a Datastorm product.

Here is a line-by-line description:

NM unix-pcomm

      The name of the item.  If the item is a program that ports to
      one environment, the name is that environment hyphenated with
      the program name; otherwise it is just the name.  Note that
      this is not intended to be useful by itself, e.g., unix-pcomm
      might eventually also refer to something that has been made
      to work under VMS.  Should there be two items with the same
      name, the later item will have its author's name appended.
      For example, should John Turkey later write a pcomm for
      UNIX, it would be called unix-pcomm-turkey.

VR version 1.1

      Some kind of version stamp.  If the item does not have
      versions, this is the date released or published, or
      something else indicating when the item came into existence.

AU egray@fthood.UUCP (Emmet P. Gray)

      This is the person or persons who wrote the thing.  If there
      is more than one author, use more than one line.

MA egray@fthood.UUCP (Emmet P. Gray)

      This is who is maintaining the item.  If the item is not
      being maintained, don't add this line.  If several people are
      maintaining it, use several lines.  Note that anyone whose
      name is on one of these lines can expect e-mail about the
      item.

EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21

      This is the person responsible for the entry and the date on
      which the entry was added or updated.

TT public domain version of ProComm (TM)

      A title for the item.

KW all-source,public-domain,datacomm

      Keywords describing the item.  Note the `all-source' keyword,
      which means that all the source (other than that of the tools
      mentioned below) needed is included.  Note also the
      public-domain keyword, which indicates that the item is in
      the public domain.

SY any:modem,sysv-unix:termcaps,install

      For each system this item runs on (or must be used on),
      there should be one of these lines.  The fields are:

      1) The hardware it runs on.  If it runs on any hardware
	 which a particular OS runs on, the entry is `any'.
	 Required additional hardware is indicated by
	 :<hardware>.

      2) The OS it runs under.  There are several generic names
	 like the `sysv-unix' above.  Optional OS things which are
	 needed are indicated the same way hardware options are.
	 Also, software which is not listed in this directory which
	 is needed to make this go is listed here.  Multiple
	 entries are separated by semicolons.  For example, if this
	 is a Dbase-II program, you'd have MS-DOS;Dbase-II in this
	 field.

      3) How much effort is needed to make it go.  If following the
	 directions is sufficient, the entry is `install'.

      4) This entry contains any tools, not normally available on
	 your system, which one must have in order to build or use
	 this item. All items which are in this section must also
	 have their own entries in the information directory.

      There may be more than one of these lines, whenever necessary.

DE Pcomm is a public domain telecommunication program for Unix that
DE is designed to operate similar to the MSDOS program, ProComm.
DE ProComm (TM) is copyrighted by Datastorm Technologies, Inc.  This
DE is a completely new program and contains no ProComm source code.
DE This is not a Datastorm product.

      This is a short descrpiton of the item. This should be kept
      brief; putting the man page here is probably not appropriate.

Here is another entry that would go in the information database.

NM free-distribution-database
VR updated continuously
AU bill@twwells.UUCP (T. William Wells)
MA bill@twwells.UUCP (T. William Wells)
EN bill@twwells.UUCP (T. William Wells) 19880926
TT Database of freely distributable, electronically accessible information.
KW database,public-domain
SY any,any,install
DE This database is constructed from the information that passes
DE through comp.archives.  It contains information on any software,
DE databases, documents, or what-have-you, that is both freely
DE distributable and available electronically.  "Freely
DE distributable" means that, if you have a copy of the item, you
DE can (at least) make exact copies and give them away, and you
DE don't have to tell the owner of the item (if any) that you have
DE done so.  "Electronically available" means that it is either
DE accessible through a publicly accessible network, or is available
DE by a means that does not involve paying a fee to the
DE distributor.  This information is provided as a free service and
DE there is *no one* guaranteeing that any of it is accurate or
DE useful.  Use it your own risk.

---

Here is the meat of the database: the index of things available from
each archive site.  This is the format:

archive-name;version;site-name;access-type;access-handle;date;tools;comments

      `Archive-name' and `version' match entries in the main
      database.  If this file is not in the database, leave the
      fields blank.  Note that this means that you can make
      available archive information about things not in the
      directory; however, this practiced is discouraged.

      `Site-name' is the name of the site, as recorded in the site
      database.

      `Access-type' is one of the access tags specified in the site
      entry.  Note that this is in the style of UNIX file names:
      wild cards are permitted.

      `Access-handle' is used with the information from the site
      entry to construct the request from the archive.  For
      example, using uucp, if the site entry contained
      /usr/archives as the path to which files names are relative,
      and this field contains foobar.shar, then the path name you
      should use to get this item is /usr/archives/foobar.shar.

      `Date' is the date which this entry was added to the
      database.

      `Tools' is a list of programs needed to unarchive the file;
      each must be a name in the info database.  Standard system
      utilities are not listed.

      `Comments' is anything useful to add.

For example, suppose I have pcomm sitting around in my directories.
I could have these records:

unix-pcomm;version 1.1;twwells;*;pcomm.1.shar.Z;1988 Oct 21;compress;part 1
unix-pcomm;version 1.1;twwells;*;pcomm.2.shar.Z;1988 Oct 21;compress;part 2
unix-pcomm;version 1.1;twwells;*;pcomm.3.shar.Z;1988 Oct 21;compress;part 3
unix-pcomm;version 1.1;twwells;*;pcomm.4.shar.Z;1988 Oct 21;compress;part 4
unix-pcomm;version 1.1;twwells;*;pcomm.5.shar.Z;1988 Oct 21;compress;part 5
unix-pcomm;version 1.1;twwells;*;pcomm.6.shar.Z;1988 Oct 21;compress;part 6
unix-pcomm;version 1.1;twwells;*;pcomm.7.shar.Z;1988 Oct 21;compress;part 7
unix-pcomm;version 1.1;twwells;*;pcomm.8.shar.Z;1988 Oct 21;compress;part 8
unix-pcomm;version 1.1;twwells;*;pcomm.p1.shar.Z;1988 Oct 21;compress;patch 1
unix-pcomm;version 1.1;twwells;*;pcomm.p2.shar.Z;1988 Oct 21;compress;patch 2
unix-pcomm;version 1.1;twwells;*;pcomm.p3.shar.Z;1988 Oct 21;compress;patch 3

This says that

    various pieces of unix-pcomm, version 1.1 are available from my site
    they can be accessed through any way that my site can be accessed
    the various pieces of it can be accessed with names beginning with pcomm
    the entries were added on October 21, 1988
    you need compress to unarchive any of it
    parts 1-8 and patches 1-5 are available

Now, suppose that I had a list of local BBS's that I was willing
to make available. It would have an entry like:

;;twwells;*;bbslist;2001 Jan 1;;bbs systems in south Florida

This says that the file bbslist is available but that it has no entry
in the information database.

---

That leaves the problem of how to distribute this database. Here
are my goals:

      1) To minimize the amount of information retransmitted
	 through the newsgroup. In an ideal world, the data would
	 get transmitted once, and everyone would thereafter query
	 archive sites for current copies.

      2) To minimize the delay in getting the information out.
	 This means avoiding batching the data; it would not be
	 very nice to hold some archive information just because no
	 one else was posting at that time.

      3) To minimize the pain of maintaining a database from the
	 information which flows through comp.archives.

The first one is the stickiest problem. If I never retransmitted any
data, sites which want to start a database would have to find someone
who was willing to let them have a copy of the database.  Where would
they find this information?  This means that I need to, at least,
periodically post a minimal database of sites that are archives for
the database.

Now, how do I best serve the needs of the guy who just has one thing
he is looking for? If I send the data just once, he is unlikely to
see it. The alternative is to send it periodically, with reasonably
long expiration dates, so that he can look on his system.

Anyway, for now, I will do the latter; if the volume gets too high,
then I'll look into some other method.

The second item means posting the information as soon as it comes in
and has been verified.  The main drawback to this is that sometimes
the information is incorrectly sent. Putting a delay in the system
results in much of this error being corrected before it gets out.  My
own feeling is to make updating the system reasonably painless, so
that if errors like this occur, they can be fixed reasonably easily.

The third item requires minimizing the information transmitted which
is used to update the database (a worthy goal of its own) and
minimizing the programming needed to maintain the database.  The
first suggests sending updates as increments: if a site adds or
deletes something, only that addition or deletion gets sent, not the
whole thing. In the interests of keeping the database simple, the
whole database should be maintained in ASCII and be maintainable with
standard UNIX tools.  Of course, it would be even better if the tools
needed to maintain this could be found through the database.

----

That leads to the problem of how to maintain the database.  First,
the subject line is used to indicate that this is a database update
message.  Such subject line starts with the string 'DB:'. This should
make it reasonable to separate these entries from the others.  The
remainder of the subject line may be used for any additional comments
I might wish to add.

The body of the message contains the database update commands.

Commands to add data look like:

      @ADD <database>

and the following data is what is to be added.  <database> is one of
the strings INFO, SITE, or INDEX. The new data is terminated by a
blank line.

Commands to delete data look like:

      @DEL <database> <key>

The key depends on what is being deleted. Deletions from the
information database just use the item name. Deletions from the site
database use the site name. Deletions from the archive index use the
site name, the access method, and the access handle for the line to be
deleted.

There is a special command to delete all index entries for a site;
its form is:

      @DELALL INDEX <site>

All of this should be reasonably easy to do; I roughed out a shell
script using sed, join, and comm that would handle this; though it
would be SLOW. However, it would be reasonable easy to write a simple
program that would be MUCH faster.

---

Ok, guys, its your turn.

---
Bill
{uunet|novavax}!proxftl!twwells!bill

grumpy@edg1 (Eric Schwarz) (10/27/88)

I've got a question and possible problem for you concerning the
database format.

Is there a reason why you are using different field delimiters for
the 3 database entry formats?  The site entry uses colons, the
information entry uses commas (with colon sub-field delimiters), and
the content entry uses semi-colons.

The site entry contains a path to the archive files, what about
archives that have multiple archive directories?  You need to know
which files are in which directories.  Is putting this information
in the content entry going to make it too big (I don't know how
many content entries there will be eventually)?

Apart from these two items, it looks pretty good.

Eric Schwarz
uunet!edg1!grumpy

bill@twwells.uucp (T. William Wells) (11/01/88)

In article <275@edg1.UUCP> grumpy@edg1 (Eric Schwarz) writes:
: Is there a reason why you are using different field delimiters for
: the 3 database entry formats?  The site entry uses colons, the
: information entry uses commas (with colon sub-field delimiters), and
: the content entry uses semi-colons.

No, other than carelessness. I am changing the format so that
semicolons are the field delimiter, commas are the smallest subfield
delimiter, and colons are the intermediate field delimiter.

: The site entry contains a path to the archive files, what about
: archives that have multiple archive directories?  You need to know
: which files are in which directories.  Is putting this information
: in the content entry going to make it too big (I don't know how
: many content entries there will be eventually)?

This one reason why there can be more than one CO line. Suppose that
I had stuff in directories /archive/foo and /archive/bar; I could
then have two CO lines:

	CO foo.uucp;uucp;/archive/foo;...
	CO bar.uucp;uucp;/archive/bar;...

In the content database, things in /archive/foo would have lines like:

	prog;vers;mysite;foo*;foo-file;...

and things in /archive/bar would have lines like:

	prog;vers;mysite;bar*;bar-file;...

This also makes it easy to tell everyone that the path has changed:
all you do is resubmit the site entry.

---
Bill
{uunet|novavax}!proxftl!twwells!bill

comparc@twwells.uucp (comp.archives) (11/11/88)

This is the second attempt at the database structure.  Changes are
still possible, so send in any comments you might have.

Here is a short summary of the changes from the previous version:

	Lines in the database beginning with # are ignored.

	The end of data in a DB: posting is signaled by a line
	containing @END.

	Everything in a DB: posting before the first line beginning
	with an @ is ignored.

	The time field in the CO line for ftp access has been changed.

	A TT line has been added to the site entry format; it
	contains a short title for the archive site.

	A TM line has been added to the site entry format; it
	specifies the best times to use the archives.

	A KW line has been added to the site entry format; it contains
	a list of keywords describing the archive.  (The original
	description said that the keywords are separated by a
	semicolon, this is an error: they are separated by commas.)

	An IX line has been added to the site entry format; it
	contains information about the index files for the archive.

	The contents lines have a new field, containing the size of
	the file in K.

	Some field delimiters have been changed. The CO line now uses
	semicolons instead of colons. The SY line now uses semicolons
	instead of commas.

---

Comments in the databases begin with a #. They are retained with the
data but are otherwise ignored.

In the line oriented databases, if there is a line that is to be left
blank, that line should still be entered, but with everything but the
keyword left blank.

---

The site database contains a series of entries separated by blank
lines.  Each entry has the following lines:

NM <the site name>
EN <who added the entry and when>
TM <best times to call the site>
TT <the name of the archive>
AD <who administers the site>
MA <the administrator's mailing address>
CO <information needed to set up communications with the site>
IX <where the index files>
KW <keywords describing the archive>
DE <description of the site>

Lines from TT to DE may be repeated as a group as often as necessary
to describe different archives at a single site.  Each of the lines
from AD to DE may be repeated as often as necessary to contain the
data.

Following is a detailed description of each line.

NM <the site name>

   This is a domain name. If you are a uucp site, you should write
   this as <site>.uucp.

EN <user>@<site> (<name>) <date>

   This says who the person is who entered the database entry.  The
   <date> is the output from the date command.

TM <time zone>;[[<day>],...<from>-<to> <load>];...

   This lets people know when the best times are to use the archive.
   The first field is the time zone the archive is contained in; all
   times in the entry are presumed to be relative to that time zone.
   <Day> is a three letter day abbreviation.  The <from> and <to> are
   times in 24 hr notation. <Load> is a single word describing the
   load on your system at these times, the suggested words are: none,
   light, moderate, heavy, swamped.

TT <the name of the archive>

   A short title for the archive.

AD <user>@<site> (<name>)

   The person who administers the archive.  If more than one person
   administers the archive, there should be more than one of these.

MA <the administrator's mailing address>

   The mailing address for help or information.  Leave this blank
   unless you want snail-mail.  People who mail to this address had
   better include a SASE or e-mail address or forget about getting
   any response.

CO <access-tag>;ftp;<name>;<internet address>;<directory>;<when available>
CO <access-tag>;uucp;<directory>;<L.sys entry>

   This line describes each method of getting at the archive. If there
   is more than one way to get at the archive, or more than one
   directory containing archive information, then there will be more
   than one of these lines.

   The <access-tag> is used when not all of your archived information
   is available through all paths to your site.  For example, you
   might have a mail based server for small items but require a
   direct link for larger things.  Each item that you list as
   available through your archive has a tag that is used to indicate
   which way it can be accessed.

   There may be more than one line for a single tag. This would mean
   that there is more than one way to get to the same set of
   information.

   The next field describes the access method.  Right now, it is
   either uucp or ftp; more will be added as needed.

   The remaining fields depend on the access method.

   There are two fields for uucp access.  The first is the path name
   which archive file names are relative to. The second is an L.sys
   entry that would be used to access your site.

   For ftp, the fields are the domain name for accessing the archive
   (which is normally the same as the site name), the internet
   address for the above, the directory where the archive information
   resides, and the times when the archive is available.

   If the archive is always available, leave that field blank. Otherwise,
   format as [[<day>,...<from>-<to>];...

IX <access-tag>;<handle>;<size>;<date>;<tools needed to unarchive>;<comments>

   This line describes the index file(s) for the archive. It is the
   same format as the entries in the index database, except that the
   first three fields are not present.

KW <keyword>,...

   This is a list of keywords that describe what the site carries.

DE <description of the site>

   This is a few lines that describe the site.  This should be kept
   reasonable short, but should give any information not specified in
   the previous lines that might be useful to the archive user.

---

The archived information database contains a series of entries
separated by blank lines.  Each entry has the following lines:

NM <name of the item>
VR <a version number>
AU <the author of the item>
MA <the maintainer of the item>
EN <who entered this into the database>
TT <a title for the item>
KW <keywords for the item>
SY <hardware and software needed for it, and how hard it is to bring it up>
DE <a short description of the item>

Following is a detailed description of each line.

NM <name>

      The name of the item.  If the item is a program that ports to
      one environment, the name is that environment hyphenated with
      the program name; otherwise it is just the name.  Note that
      this is not intended to be useful by itself, e.g., unix-pcomm
      might eventually also refer to something that has been made to
      work under VMS.  Should there be two items with the same name,
      the later item will have its author's name appended.  For
      example, should John Turkey later write a pcomm for UNIX, it
      would be called unix-pcomm-turkey.

VR version <version>
VR date <date>

     These tell which version this entry refers to. The first form is
     used for things with named versions, the second is used for
     something which is regularly updated. The date, for the second
     format, is yymmdd, and specifies the date the thing was last
     updated. Some things are so continuously updated that they
     should not have a version; for them, leave this line blank.

AU <user>@<site> (<name>)

      This is the person or persons who wrote the thing.  If there
      is more than one author, use more than one line.

MA <user>@<site> (<name>)

      This is who is maintaining the item.  If the item is not being
      maintained, leave this blank.  If several people are
      maintaining it, use several lines.  Note that anyone whose name
      is on one of these lines can expect e-mail about the item.

EN <user>@<site> (<name>) <date>

      This is the person responsible for the entry and the date on
      which the entry was added or updated.

TT <title>

      A title for the item.

KW <keyword>,...

      Keywords describing the item.  Some good kinds of keywords:
      `all-source', which means that all the source (other than that
      of the tools mentioned below) needed is included;
      `public-domain', which indicates that the item is in the public
      domain.

SY <hardware>[:<hw add-ons},...];<software>[:<sw add-ons},...];
     <effort-needed>;<tools-needed>

      For each system this item runs on (or must be used on), there
      should be one of these lines.  The fields are:

      1) The hardware it runs on.  If it runs on any hardware which a
	 particular OS runs on, the entry is `any'.  If the item
	 needs hardware other than the standard for the system, add
	 words for it after a colon.

      2) The OS it runs under.  There are several generic names like
	 `unix' or `sysv-unix'.  Optional OS things which are needed
	 are indicated the same way hardware options are.  Also,
	 software which is not listed in this database which is
	 needed to make this item go is listed here.  For example,
	 were this item to be a Dbase program, this field would be:
	 MS-DOS:Dbase-II.

      3) How much effort is needed to make it go.  If following the
	 directions is sufficient, the entry is `install'.

      4) This entry contains any tools, not normally available on
	 your system, which one must have in order to build or use
	 this item. All items which are in this section must also
	 have their own entries in the information directory.

DE <some text>

      This is a short descrpiton of the item. This should be kept
      brief; putting the man page here is not appropriate.

Here is an entry, suitable for the databases created through
comp.archives.

NM free-distribution-database
VR
AU bill@twwells.UUCP (T. William Wells)
MA bill@twwells.UUCP (T. William Wells)
EN bill@twwells.UUCP (T. William Wells) Fri Nov 11 00:56:16 EST 1988
TT Database of freely distributable, electronically accessible information.
KW database,public-domain
SY any;any;;
DE This database is constructed from the information that passes
DE through comp.archives.  It contains information on any software,
DE databases, documents, or what-have-you, that is both freely
DE distributable and available electronically.  "Freely
DE distributable" means that, if you have a copy of the item, you
DE can (at least) make exact copies and give them away, and you
DE don't have to tell the owner of the item (if any) that you have
DE done so.  "Electronically available" means that it is either
DE accessible through a publicly accessible network, or is available
DE by a means that does not involve paying a fee to the
DE distributor.  This information is provided as a free service and
DE there is *no one* guaranteeing that any of it is accurate or
DE useful.  Use it your own risk.

---

The site index ties the previous two databases together.
This is the format:

<name>;<version>;<archive>;<access-tag>;<handle>;<size>;
    <date>;<tools>;<comments>

	The first two fields link this entry to an entry in the info
	database; they correspond to the NM and VR fields.  If this
	file is not listed in the database, these fields are blank.

	`Site-name' is the name of the site, as recorded in the site
	database.

	`Access-type' is one of the access tags specified in the site
	entry.  Note that this is in the style of UNIX file names:
	wild cards are permitted.

	`handle' is used with the information from the site entry to
	construct the request from the archive.  For example, using
	uucp, if the site entry contained /usr/archives as the path
	to which files names are relative, and this field contains
	foobar.shar, then the path name you should use to get this
	item is /usr/archives/foobar.shar.

	`Date' is the date which this entry was added to the database.
	This should be yymmdd.

	`Tools' is a list of programs needed to unarchive the file;
	each must be a name in the info database.  Standard system
	utilities are not listed.

	`Comments' is anything useful to add.

---

The DB: postings contain information to update the database.  The
update information starts with the first line beginning with an @ and
ends with a line containing @END. Additional information, not
intended to be part of the database can be added before the first @
line or after the @END line.

Commands to add data look like:

      @ADD <database>

and the following data is what is to be added.  <database> is one of
the strings INFO, SITE, or INDEX. The new data is terminated by a
blank line. This blank line is required, no matter what the next
command is.

Commands to delete data look like:

      @DEL <database> <key>

The key depends on what is being deleted. Deletions from the
information database just use the item name. Deletions from the site
database use the site name. Deletions from the archive index use the
site name, the access method, and the access handle for the line to be
deleted.

There is a special command to delete all index entries for a site;
its form is:

      @DELALL INDEX <site>

---
Bill
{uunet|novavax}!proxftl!twwells!bill

send comp.archives postings to twwells!comp-archives
send comp.archives related mail to twwells!comp-archives-request

comparc@twwells.uucp (comp.archives) (12/01/88)

This is the third attempt at the database structure.  Changes are
still possible, so send in any comments you might have.

Here is a short summary of the changes from the previous version:

    1) The definition of a site is somewhat vague. What I am going to
       do is to consider one set of archives under the control of a
       single administrator as an archive site.  This means that the
       site entry won't have different sets of data for archives
       located at the same site.  This also means that the archive
       name will be somewhat less related to the address of the
       archive.

    2) The access method and access tag of the CO fields have been
       swapped. The access method now comes first.

---

Comments in the databases begin with a #. They are retained with the
data but are otherwise ignored.

In the line oriented databases, if there is a line that is to be left
blank, that line should still be entered, but with everything but the
keyword left blank.

---

The site database contains a series of entries separated by blank
lines.  Each entry has the following lines:

NM <the site name>
EN <who added the entry and when>
TM <best times to call the site>
TT <the name of the archive>
AD <who administers the site>
MA <the administrator's mailing address>
CO <information needed to set up communications with the site>
IX <where the index files>
KW <keywords describing the archive>
DE <description of the site>

Each of the lines from AD to DE may be repeated as often as necessary
to contain the data.

Following is a detailed description of each line.

NM <the site name>

   This name should be related to the address used to find the site,
   though it doesn't have to.  This should be kept fairly short.

EN <user>@<site> (<name>) <date>

   This says who the person is who entered the database entry.  The
   <date> is the output from the date command.

TM <time zone>;[[<day>],...<from>-<to> <load>];...

   This lets people know when the best times are to use the archive.
   The first field is the time zone the archive is contained in; all
   times in the site entry are presumed to be relative to that time
   zone.  <Day> is a three letter day abbreviation.  The <from> and
   <to> are times in 24 hr notation. <Load> is a single word
   describing the load on your system at these times, the suggested
   words are: none, light, moderate, heavy, swamped, best, worst.

TT <the name of the archive>

   A short title for the archive.

AD <user>@<site> (<name>)

   The person who administers the archive.  If more than one person
   administers the archive, there should be more than one of these.

MA <the administrator's mailing address>

   The mailing address for help or information.  Leave this blank
   unless you want snail-mail.  People who mail to this address had
   better include a SASE or e-mail address or forget about getting
   any response.

CO ftp;<access-tag>;<name>;<internet address>;<directory>;<when available>
CO uucp;<access-tag>;<directory>;<L.sys entry>

   This line describes each method of getting at the archive. If there
   is more than one way to get at the archive, or more than one
   directory containing archive information, then there will be more
   than one of these lines.

   The <access-tag> is used when not all of your archived information
   is available through all paths to your site.  For example, you
   might have a mail based server for small items but require a
   direct link for larger things.  Each item that you list as
   available through your archive has a tag that is used to indicate
   which way it can be accessed.

   There may be more than one line for a single tag. This would mean
   that there is more than one way to get to the same set of
   information.

   The next field describes the access method.  Right now, it is
   either uucp or ftp; more will be added as needed.

   The remaining fields depend on the access method.

   There are two fields for uucp access.  The first is the path name
   which archive file names are relative to. The second is an L.sys
   entry that would be used to access your site.

   For ftp, the fields are the domain name for accessing the archive,
   the internet address for the above, the directory where the
   archive information resides, and the times when the archive is
   available.

   If the archive is always available, leave that field blank.
   Otherwise, format as [[<day>,...<from>-<to>];...

IX <access-tag>;<handle>;<size>;<date>;<tools needed to unarchive>;<comments>

   This line describes the index file(s) for the archive. It is the
   same format as the entries in the index database, except that the
   first three fields are not present. You should also list README
   files and the like.

KW <keyword>,...

   This is a list of keywords that describe what the site carries.

DE <description of the site>

   This is a few lines that describe the site.  This should be kept
   reasonable short, but should give any information not specified in
   the previous lines that might be useful to the archive user.

---

The archived information database contains a series of entries
separated by blank lines.  Each entry has the following lines:

NM <name of the item>
VR <a version number>
AU <the author of the item>
MA <the maintainer of the item>
EN <who entered this into the database>
TT <a title for the item>
KW <keywords for the item>
SY <hardware and software needed for it, and how hard it is to bring it up>
DE <a short description of the item>

Following is a detailed description of each line.

NM <name>

      The name of the item.  If the item is a program that runs in
      one environment, the name is that environment hyphenated with
      the program name; otherwise it is just the name.  Note that
      this is not intended to be useful by itself, e.g., unix-pcomm
      might eventually also refer to something that has been made to
      work under VMS.  Should there be two items with the same name,
      the later item will have its author's name appended.  For
      example, should John Turkey later write a pcomm for UNIX, it
      would be called unix-pcomm-turkey.

VR version <version>
VR date <date>

     These tell which version this entry refers to. The first form is
     used for things with named versions, the second is used for
     something which is regularly updated. The date, for the second
     format, is yymmdd, and specifies the date the thing was last
     updated. Some things are so continuously updated that they
     should not have a version; for them, leave this line blank.

AU <user>@<site> (<name>)

      This is the person or persons who wrote the thing.  If there
      is more than one author, use more than one line.

MA <user>@<site> (<name>)

      This is who is maintaining the item.  If the item is not being
      maintained, leave this blank.  If several people are
      maintaining it, use several lines.  Note that anyone whose name
      is on one of these lines can expect e-mail about the item.

EN <user>@<site> (<name>) <date>

      This is the person responsible for the entry and the date on
      which the entry was added or updated.

TT <title>

      A title for the item.

KW <keyword>,...

      Keywords describing the item.  Some good kinds of keywords:
      `all-source', which means that all the source (other than that
      of the tools mentioned below) needed is included;
      `public-domain', which indicates that the item is in the public
      domain.

SY <hardware>[:<hw add-ons},...];<software>[:<sw add-ons},...];
     <effort-needed>;<tools-needed>

      For each system this item runs on (or must be used on), there
      should be one of these lines.  The fields are:

      1) The hardware it runs on.  If it runs on any hardware which a
	 particular OS runs on, the entry is `any'.  If the item
	 needs hardware other than the standard for the system, add
	 words for it after a colon.

      2) The OS it runs under.  There are several generic names like
	 `unix' or `sysv-unix'.  Optional OS things which are needed
	 are indicated the same way hardware options are.  Also,
	 software which is not listed in this database which is
	 needed to make this item go is listed here.  For example,
	 were this item to be a Dbase program, this field would be:
	 MS-DOS:Dbase-II.

      3) How much effort is needed to make it go.  If following the
	 directions is sufficient, the entry is `install'.

      4) This entry contains any tools, not normally available on
	 your system, which one must have in order to build or use
	 this item. All items which are in this section must also
	 have their own entries in the information directory.

DE <some text>

      This is a short descrpiton of the item. This should be kept
      brief; putting the man page here is not appropriate.

Here is an entry, suitable for the databases created through
comp.archives.

NM free-distribution-database
VR
AU bill@twwells.UUCP (T. William Wells)
MA bill@twwells.UUCP (T. William Wells)
EN bill@twwells.UUCP (T. William Wells) Fri Nov 11 00:56:16 EST 1988
TT Database of freely distributable, electronically accessible information.
KW database,public-domain
SY any;any;;
DE This database is constructed from the information that passes
DE through comp.archives.  It contains information on any software,
DE databases, documents, or what-have-you, that is both freely
DE distributable and available electronically.  "Freely
DE distributable" means that, if you have a copy of the item, you
DE can (at least) make exact copies and give them away, and you
DE don't have to tell the owner of the item (if any) that you have
DE done so.  "Electronically available" means that it is either
DE accessible through a publicly accessible network, or is available
DE by a means that does not involve paying a fee to the
DE distributor.  This information is provided as a free service and
DE there is *no one* guaranteeing that any of it is accurate or
DE useful.  Use it your own risk.

---

The site index ties the previous two databases together.  This is the
format:

<name>;<version>;<archive>;<access-tag>;<handle>;<size>;
    <date>;<tools>;<comments>

	The first two fields link this entry to an entry in the info
	database; they correspond to the NM and VR fields.  If this
	file is not listed in the database, these fields are blank.

	`Site-name' is the name of the site, as recorded in the site
	database.

	`Access-type' is one of the access tags specified in the site
	entry.  Note that this is in the style of UNIX file names:
	wild cards are permitted.

	`handle' is used with the information from the site entry to
	construct the request from the archive.  For example, using
	uucp, if the site entry contained /usr/archives as the path
	to which files names are relative, and this field contains
	foobar.shar, then the path name you should use to get this
	item is /usr/archives/foobar.shar.

	`Date' is the date which this entry was added to the database.
	This should be yymmdd.

	`Tools' is a list of programs needed to unarchive the file;
	each must be a name in the info database.  Standard system
	utilities are not listed.

	`Comments' is anything useful to add.

---

The DB: postings contain information to update the database.  The
update information starts with the first line beginning with an @ and
ends with a line containing @END. Additional information, not
intended to be part of the database can be added before the first @
line or after the @END line.

Commands to add data look like:

      @ADD <database>

and the following data is what is to be added.  <database> is one of
the strings INFO, SITE, or INDEX. The new data is terminated by a
blank line. This blank line is required, no matter what the next
command is.

Commands to delete data look like:

      @DEL <database> <key>

The key depends on what is being deleted. Deletions from the
information database just use the item name. Deletions from the site
database use the site name. Deletions from the archive index use the
site name, the access method, and the access handle for the line to be
deleted.

There is a special command to delete all index entries for a site;
its form is:

      @DELALL INDEX <site>

---
Bill
{uunet|novavax}!proxftl!twwells!bill

send comp.archives postings to twwells!comp-archives
send comp.archives related mail to twwells!comp-archives-request

comparc@twwells.uucp (comp.archives) (01/03/89)

Here is a short summary of the changes from the previous version:

    1) Two new access methods have been added, one for fidonet and
       for BBS's.

    2) All file sizes should be in K; this was not stated in the
       previous version.

    3) Text on the DE lines should be kept to less than 70
       characters; this makes life easier for pretty-printing the
       archive information.

    4) Lines that have fields separated by semicolons should have all
       the semicolons on the line, including trailing ones.  This was
       not specified in the previous version.

    5) The key separator on @DEL lines is a semicolon. This was not
       specified in the previous version.

---

Comments in the databases begin with a #. They are retained with the
data but are otherwise ignored.

In the line oriented databases, if there is a line that is to be left
blank, that line should still be entered, but with everything but the
keyword left blank.

Lines that have fields separated by semicolons should have all
semicolons on the line, including trailing ones.

---

The site database contains a series of entries separated by blank
lines.  Each entry has the following lines:

NM <the site name>
EN <who added the entry and when>
TM <best times to call the site>
TT <the name of the archive>
AD <who administers the site>
MA <the administrator's mailing address>
CO <information needed to set up communications with the site>
IX <where the index files>
KW <keywords describing the archive>
DE <description of the site>

Each of the lines from AD to DE may be repeated as often as necessary
to contain the data.

Following is a detailed description of each line.

NM <the site name>

   This name should be related to the address used to find the site,
   though it doesn't have to.  This should be kept fairly short.

EN <user>@<site> (<name>) <date>

   This says who the person is who entered the database entry.  The
   <date> is the output from the date command.

TM <time zone>;[[<day>],...<from>-<to> <load>];...

   This lets people know when the best times are to use the archive.
   The first field is the time zone the archive is contained in; all
   times in the site entry are presumed to be relative to that time
   zone.  <Day> is a three letter day abbreviation.  The <from> and
   <to> are times in 24 hour notation. <Load> is a single word
   describing the load on your system at these times, the suggested
   words are: none, light, moderate, heavy, swamped, best, worst.

TT <the name of the archive>

   A short title for the archive.

AD <user>@<site> (<name>)

   The person who administers the archive.  If more than one person
   administers the archive, there should be more than one of these.

MA <the administrator's mailing address>

   The mailing address for help or information.  Leave this blank
   unless you want snail-mail.  People who mail to this address had
   better include a SASE or e-mail address or forget about getting
   any response.

CO ftp;<access tag>;<name>;<internet address>;<directory>;<when available>
CO uucp;<access tag>;<directory>;<L.sys entry>
CO fido;<access tag>;<access-info>
CO bbs;<access tag>;<phone>;<when available>;<modem settings>;
	<protocols supported>;<comments>

   This line describes each method of getting at the archive. If there
   is more than one way to get at the archive, or more than one
   directory containing archive information, then there will be more
   than one of these lines.

   The <access tag> is used when not all of your archived information
   is available through all paths to your site.  Suppose that you had
   two archives, one of small programs that you had a mail-based
   server for, and another of larger stuff that you want to transfer
   only through uucp.  Your CO line for mail access could have an
   access tag of `mail' and your CO line for uucp access could have
   an access tag of `uucp'.

   Files which are available only through mail would have an access
   tag of `mail'. Files available only through uucp would have an
   access tag of `uucp'.  Files that were available either way would
   have an access tag of `*'.

   There may be more than one line for a single tag. This would mean
   that there is more than one way to get to the same set of
   information.

   The next field describes the access method.  Right now, it is one
   of uucp, ftp, fido, or bbs; more will be added as needed.

   The remaining fields depend on the access method.

   There are two fields for uucp access.  The first is the path name
   which archive file names are relative to. The second is an L.sys
   entry that would be used to access your site.

   For ftp, the fields are the domain name for accessing the archive,
   the internet address for the above, the directory where the
   archive information resides, and the times when the archive is
   available.

   If the archive is always available, leave that field blank.
   Otherwise, format as [[<day>,...<from>-<to>];...

   There is one field for fidonet. This is some information needed
   for accessing the archive, as yet I have no idea what this info is.

   There are five fields for BBS access. The first is the phone
   number; if you want them, use hyphens for digit separators.  The
   second field indicates when the BBS is available; leave it blank
   if it is always available.  The modem settings are a comma
   separated list of entries like: <data bits><parity><stop
   bits>:<speed>. Parity is represented by one of the letters: (N)o,
   (E)ven, (O)dd, (M)ark, (S)pace. The protocols suppoerted field
   indicates which protocols are available for file transfer.  The
   final field is for additional comments about getting into the BBS.

IX <access tag>;<handle>;<size>;<date>;<tools needed to unarchive>;<comments>

   This line describes the index file(s) for the archive. It is the
   same format as the entries in the index database, except that the
   first three fields are not present. You should also list README
   files and the like. Note that the file size should be in K's.

KW <keyword>,...

   This is a list of keywords that describe what the site carries.

DE <description of the site>

   This is a few lines that describe the site.  This should be kept
   reasonably short, but should give any information not specified in
   the previous lines that might be useful to the archive user.  The
   text on these lines should be kept to less than 70 characters.

---

The archived information database contains a series of entries
separated by blank lines.  Each entry has the following lines:

NM <name of the item>
VR <a version number>
AU <the author of the item>
MA <the maintainer of the item>
EN <who entered this into the database>
TT <a title for the item>
KW <keywords for the item>
SY <hardware and software needed for it, and how hard it is to bring it up>
DE <a short description of the item>

Following is a detailed description of each line.

NM <name>

      The name of the item.  If the item is a program that runs in
      one environment, the name is that environment hyphenated with
      the program name; otherwise it is just the name.  Note that
      this is not intended to be useful by itself, e.g., unix-pcomm
      might eventually also refer to something that has been made to
      work under VMS.  Should there be two items with the same name,
      the later item will have its author's name appended.  For
      example, should John Turkey later write a pcomm for UNIX, it
      would be called unix-pcomm-turkey.

VR version <version>
VR date <date>

     These tell which version this entry refers to. The first form is
     used for things with named versions, the second is used for
     something which is regularly updated. The date, for the second
     format, is yymmdd, and specifies the date the thing was last
     updated. Some things are so continuously updated that they
     should not have a version; for them, leave this line blank.

AU <user>@<site> (<name>)

      This is the person or persons who wrote the thing.  If there
      is more than one author, use more than one line.

MA <user>@<site> (<name>)

      This is who is maintaining the item.  If the item is not being
      maintained, leave this blank.  If several people are
      maintaining it, use several lines.  Note that anyone whose name
      is on one of these lines can expect e-mail about the item.

EN <user>@<site> (<name>) <date>

      This is the person responsible for the entry and the date on
      which the entry was added or updated.

TT <title>

      A title for the item.

KW <keyword>,...

      Keywords describing the item.  Some good kinds of keywords:
      `all-source', which means that all the source (other than that
      of the tools mentioned below) needed is included;
      `public-domain', which indicates that the item is in the public
      domain.

SY <hardware>[:<hw add-ons},...];<software>[:<sw add-ons},...];
     <effort-needed>;<tools-needed>

      For each system this item runs on (or must be used on), there
      should be one of these lines.  The fields are:

      1) The hardware it runs on.  If it runs on any hardware which a
	 particular OS runs on, the entry is `any'.  If the item
	 needs hardware other than the standard for the system, add
	 words for it after a colon.

      2) The OS it runs under.  There are several generic names like
	 `unix' or `sysv-unix'.  Optional OS things which are needed
	 are indicated the same way hardware options are.  Also,
	 software which is not listed in this database which is
	 needed to make this item go is listed here.  For example,
	 were this item to be a Dbase program, this field would be:
	 MS-DOS:Dbase-II.

      3) How much effort is needed to make it go.  If following the
	 directions is sufficient, the entry is `install'.

      4) This entry contains any tools, not normally available on
	 your system, which one must have in order to build or use
	 this item. All items which are in this section must also
	 have their own entries in the information directory.

DE <some text>

      This is a short descrpiton of the item. This should be kept
      brief; putting the man page here is not appropriate.  The text
      on these lines should be kept to less than 70 characters.

Here is an entry, suitable for the databases created through
comp.archives.

NM free-distribution-database
VR
AU bill@twwells.UUCP (T. William Wells)
MA bill@twwells.UUCP (T. William Wells)
EN bill@twwells.UUCP (T. William Wells) Fri Nov 11 00:56:16 EST 1988
TT Database of freely distributable, electronically accessible information.
KW database,public-domain
SY any;any;;
DE This database is constructed from the information that passes
DE through comp.archives.  It contains information on any software,
DE databases, documents, or what-have-you, that is both freely
DE distributable and available electronically.  "Freely
DE distributable" means that, if you have a copy of the item, you
DE can (at least) make exact copies and give them away, and you
DE don't have to tell the owner of the item (if any) that you have
DE done so.  "Electronically available" means that it is either
DE accessible through a publicly accessible network, or is available
DE by a means that does not involve paying a fee to the
DE distributor.  This information is provided as a free service and
DE there is *no one* guaranteeing that any of it is accurate or
DE useful.  Use it your own risk.

---

The site index ties the previous two databases together.  This is the
format:

<name>;<version>;<archive>;<access tag>;<handle>;<size>;
    <date>;<tools>;<comments>

	The first two fields link this entry to an entry in the info
	database; they correspond to the NM and VR fields.  If this
	file is not listed in the database, these fields are blank.

	`Site-name' is the name of the site, as recorded in the site
	database.

	`Access-type' is one of the access tags specified in the site
	entry.  Note that this is in the style of UNIX file names:
	wild cards are permitted.

	`handle' is used with the information from the site entry to
	construct the request from the archive.  For example, using
	uucp, if the site entry contained /usr/archives as the path
	to which files names are relative, and this field contains
	foobar.shar, then the path name you should use to get this
	item is /usr/archives/foobar.shar.

	`Size' is the size of the file, in K.

	`Date' is the date which this entry was added to the database.
	This should be yymmdd.

	`Tools' is a list of programs needed to unarchive the file;
	each must be a name in the info database.  Standard system
	utilities are not listed.

	`Comments' is anything useful to add.

---

The DB: postings contain information to update the database.  The
update information starts with the first line beginning with an @ and
ends with a line containing @END. Additional information, not
intended to be part of the database, can be added before the first @
line or after the @END line.

Commands to add data look like:

      @ADD <database>

and the following data is what is to be added.  <database> is one of
the strings INFO, SITE, or INDEX. The new data is terminated by a
blank line. This blank line is required, no matter what the next
command is.

Commands to delete data look like:

      @DEL <database> <key>

The key depends on what is being deleted. Deletions from the
information database just use the item name. Deletions from the site
database use the site name. Deletions from the archive index use the
site name, the access method, and the access handle for the line to be
deleted. Semicolons are used to separate the key fields.

There is a special command to delete all index entries for a site;
its form is:

      @DELALL INDEX <site>

---
Bill
{uunet|novavax}!proxftl!twwells!bill

send comp.archives postings to twwells!comp-archives
send comp.archives related mail to twwells!comp-archives-request