[comp.sys.amiga] IFF archive proposal - Next!

FATQW@USU.BITNET (02/03/88)
                                  A R C
                              ARC/FLST/TREE
                       An IFF File Archive Format
                         prepared by Bryan Ford

   There are three sections in this document.  The first one describes the
ARC form, the second describes the FLST form, and the third one describes
the TREE form. Each are independent, but they "make sense" more when they
are put together.

   It is the almost unlimited nesting and expansion capability of the IFF
format that makes this file format possible.  Thanks EA!

   An archive file is made up of zero or more ARC chunks, with FLST chunks
as their "children." In other words, the ARC chunks are the tree and
branches, while the FLST chunks are the "leaves".  Also, a TREE may be
included in the beginning of an ARC to provide information about the
archive which would normally require seeking through the entire archive. In
addition, archives may be spread across multiple disks or other media for
backup purposes.

                    Section 1: FORM ARC - Archive

   The ARC form is a form for collecting more than one file into one file.
It can also specify subdirectories to be created before it is unarced, and
it can contain nested FORM ARCs as well as FLSTs.  A TREE chunk may be put
at the beginning of an ARC which would describe the content of the file
without having to scan the entire archive.

   SBDR - Subdirectory.  This chunk contains the same information as the
SPEC chunk in the FORM FLST, specifying a subdirectory for this ARC to be
unarced into. If the specified subdirectory does not already exist, the
unarcing program will create it in the directory specified by the user (or
the directory that a parent ARC was unarced into).  However, unarcing
programs should have the capability of overriding this and unarcing into
the specified directory without looking at this. This chunk may or may not
be included in the top ARC form, but it MUST be included in all sub-ARCs.

   SORT - Sorting.  This chunk tells the unarchiver how the files are
sorted. It contains one word of data specifying the sorting used: 0=no
sorting, 1=filename, 2=date, 3=size, 4=sorting algorithm used,5=percentage
of space freed by compression.  If there is no SORT chunk present then it
is assumed that the file is not sorted.  This must ALWAYS reflect the
truth. For example, if an unarchiver finds an archive with a SORT of 1, and
it's looking for A*.*, then it needs not look anymore as soon as it gets to
a B or whatever.  Also remember that this sorting applies to this ARC and
all sub-ARCs within this one, but no others.

   ANAM - Archiver name.  This contains a null-terminated string telling
the name of the program that created this archive.  This chunk, if included
at all, should only be included in the top-level ARC chunk.  It is provided
only for the benifit of the user.  Unarchivers will simply ignore this
chunk, except to print its contents at the user's request.  Archivers may
or may not include this chunk, according to the programmer.

   NEXT - Continuation chunk.  This chunk is provided in order to split
archives up over several volumes for backup purposes.  It is used in
combination with the PREV chunk.  It has a null-terminated string giving
the complete path to the next continuation file.  Each file is completely
separate, except they are linked with the NEXT and PREV chunks in the
FORM ARC, and the files may be split up between archives, making it
necessary to have access to more than one of the archives in order to
extract one file.

   PREV - Previous chunk.  This chunk points to the previous archive file
in the chain.  The NEXT chunk in the file pointed to by this PREV points to
this file.  In other words, the files form a doubly-linked list.

   FORM TREE - The archive tree.  There may be only one of these in any
archive, although it's not required.  If included, it should be in the
top-level ARC, and there should be one for every archive in a chain
(see NEXT), describing the files it contains.  They are provided to speed
up access to the files in the archive by providing useful information at
the beginning. This FORM must be before any ARCs or FLSTs. The format for
the TREE is described in section 3.  (Also from Miles Johnson)

   FORM ARC - A child ARC.  An ARC may contain other ARCs.  This is useful
for "sub-archives" which, when unarced, go into various directories
automatically.  A child ARC doesn't necessarily need a SBDR chunk, but it
makes little sense otherwise.

   FORM FLST - Files.  These chunks contain the actual files which make up
the archive. These are not necessary for an archive, and in some cases this
may be useful.  For one example, maybe an archive wants to create several
subdirectories which contain files, but no files in the root.  As another
example, you may want a child ARC to be completely empty except for a SBDR
chunk - for example a "saved games" directory without any saved games.  In
other words, this may be useful for just creating directories to be used
later, but not put any files in it.

                       Section 2: FORM FLST - File

   These FORMs contain files, and make up the archive "meat".  Each may
contain several chunks described below.  They may be split up into multiple
parts for data recovery purposes, as well as split over volumes to
facilitate making backups.

   SPEC - Filespec.  This chunk contains one longword length of the decoded
image, a three longword DateStamp (days, mins, ticks), a longword for the
protection bits, and a null-terminated filename. This is the only required
chunk in the FORM FLST. The date is not changed by archiving or unarcing a
file, only modifying it changes the date.

   CMNT - Comment.  This is a null-terminated string containing any comment
for the file.  It should never be null - if the file doesn't have a
comment, then this block shouldn't appear.

   SPLT - File splitting.  This chunk is only required if the file has
been split in any way.  If it is not included, the default is zero for the
first and third words, and one for the second (middle). It contains three
words of data: the number of BODY chunks in previous archive files (denoted
by the PREV chunk in the top ARC), the number of BODY chunks in this
archive file, and the number of sections in archive files after this one
(denoted by the NEXT chunk in the top ARC). This is actually a replacement
for the older SECT chunk.  It is used to split files into smaller, more
manageable parts, and to split files between volumes, for backup purposes.
If the first and third words are zero, this file has been split up, but not
between volumes.  If one or both is nonzero, it means that this file has
been split up over one or more separate archive files, so to extract this
file an unarchiver will have to access more than one archive.  If the first
word is nonzero, this file must be the first in this archive.  If the last
word is nonzero, it must be the last in the archive.  Both are possible at
the same time.  They indicate how many BODY chunks come before this
archive, in it, and after it.  The unarchiver should use the NEXT and PREV
chunks in the top ARC to find the other parts to this file.  Note:  If a
file is split up over several archive files, the same header chunks MUST be
exacly duplicated in all the parts.  This includes NAME, LEN, CMNT, PSWD,
DATE, and all other header chunks except those used for controlling the
splitting.
   Splitting up files even in only one archive can be very valuable for
data recovery.  If part of a file is munged, the rest of the file may be
salvaged. For example, an archiving program can break up the file into
multiple compression chunks, and include a SPLT chunk with the number of
compression chunks. Each compression chunk will contain, say, one page of
data. Archivers should try to be intelligent when they split a file. For
example, if it's a text file, split it after each page break. If it's an
IFF file, split it between the chunks.  If one part is bad, the rest will
get put together, so the user might still get part of the file back.

   PROT - File protection.  Warning: Hot subject.  This chunk signifies to
an unarchiver that this file is password protected, and it should let the
user enter a password and then decode the file according to the password
entered.  The password is not actually stored anywhere in the file, so
hackers can't write programs which simply look through the archive and
print the passwords.  Crackers have to "crack".  Anyway, this contains one
longword, which is an identifier for the type of encoding used, followed
optionally by data which applies to the particular encoding method
selected.  When the user enters a password, the unarchiver will decode the
file according to the password entered - if the password is wrong, the file
will simply be decoded as garbage.  Passwords must be mapped to uppercase
by archivers and unarchivers.  As of yet, I don't know of any specific
methods of encrypting files, so it's up to you guys to fill in the blanks!

   CRC - CRC check.  This chunk was modified from its original definition
to accomodate multiple program sections.  The chunk contains as many words
of data as there BODY chunks in this archive file - one CRC for each
section.  Note that this includes only the BODY chunks included within
this archive file, and not any that are in other archives, if the file is
split between volumes.  If there are too many CRC words, an unarchiver will
ignore the rest. If there are too few, the unarchiver will check only the
sections with CRCs supplied, and possibly give a warning to the user. If
there is no CRC chunk, no checking will be done.

   FORM ILBM - Icons.  This is a standard ILBM picture chunk describing an
icon for this file, if any.  This form may also have an ARC-specific
property chunk, SPEC, which has exacly the same format as the SPEC chunk
described above, but without the filename.  It contains the date,
protection, etc. for the icon file.  Also, a CMNT chunk may appear in the
icon chunk which is the comment for the .info file.

   BODY - These chunks comprise the actual data of the file, or the data
of one section of the file. There will be only one BODY chunk unless the
file is broken up into sections (see SPLT for details).  This chunk always
contains one longword at the beginning, which is an identifier for the
compression format, and the compressed bytes of the file.  The actual data
of the file will depend on the compression used. The predefined compression
algorithms are defined below.

   BODY/STOR - Storage without compression.  This is usually used for very
small files which would not gain anything in compression.  The chunk's data
is an exact duplicate of what will go into the file.

   BODY/PACK - Packing.  This algorithm is the simplest, and simply sticks
repetitious bytes together.

   BODY/LZIV - Lempel-Ziv encoding. This contains one byte which tells
the "number of bits" used, followed by the data.  Typically 12 (crunching)
or 13 (squashing) bits. As of yet, I don't have the docs for this format,
but as soon as I get them, I'll include them in a different doc file.

   BODY/HUFF - Huffman encoding.  I don't have any docs for this one
either, so I welcome any mail containing docs on this.

   BODY/TPAK - Text packing.  This is my own format which I'm still working
on.  It will be specifically for documents and other human-type text.  It
will be able to crunch large documents down by a huge amount, but small
ones won't do so well.  It only works with text, as the 7th bit gets
stripped, and it won't handle "words" more than 255 characters in length.
Comments welcome as soon as I finish and post it, but not until then. :-)

   When better compression algorithms come out, they may be added to the
BODY specification.  However, remember that old archivers won't be able to
use new compression formats.

   Also, archiving programs aren't required to analyze files to make sure
they're getting the maximum efficiency.  A user may want to just always use
12-bit Lempel-Ziv encoding, since it's the most often used.  Archivers
probably should have an option to disable analyzing the file.

             Section 3: FORM TREE - Directory and file trees

   This form may appear only in a top-level ARC.  When used in an archive,
it describes the content of that archive, and makes scanning the archive
much faster.  It may also be useful alone for creating directory trees. The
chunks that apply specifically to ARCs are noted.

   DRNM - Direcotry name.  This contains a null-terminated string giving
the name of this volume or directory.  It must appear before any other
chunks in the FORM TREE.  It is required for all sub-trees, but not
required (although recommended) for the top-level TREE.

   FILE - Filespec.  This chunk contains one longword containing the length
of the file followed by a null-terminated string which names a file in this
directory or volume.  It may be followed by any of the following "modifier"
chunks which describe the file in more detail.

   POS - Position.  This chunk is valid only when used in archive files. It
contains one longword of data, which is the absolute offest from the
beginning of this archive file which the FORM FLST for this file will be
found.  This also may be used after subdirectory descriptions (FORM TREEs),
which tell where in the archive file the FORM ARC will be found.  Note that
if there is any suspicion of data corruption, this should NOT be used by an
unarchiver, since it uses absolute references into the file.  Also, the
user should have the option to disable both using these chunks when found,
and writing them to files.

   FORM TREE - Subdirectories.  This is describes a subdirectory within
this directory.  The name can be gotten from the NAME field within this
FORM.

   Although this document is not copyrighted or anything, please don't
redistribute it very much.  This is because it's only a draft, and it will
probably get changed, and we want EVERYONE to have the same thing.  So if
you decide to send it to local BBSs or something, please take the
responsibility of updating them too.  Thanks.

   Please feel free to email suggestions for this file.  If someone would
like to volunteer to cross-post this discussion to BIX, I would appreciate
it, since I don't have access to BIX.

   And if anybody has the docs for Lempel-Ziv and Huffman encoding, or any
other "interesting" formats, I'd appreciate it in my mailbox (address
below).

                             History

  date         author                      changes
-------- ------------------ ---------------------------------------------
  ????       Bryan Ford     Gave birth to this file
01/08/88     Bryan Ford     +ANAM SECT, ~CRC
01/17/88     Bryan Ford     BODY=STOR+CRNC+PACK+SQEZ+SQSH, :-)8
01/20/88     Bryan Ford     +FORM_TREE NEXT PREV CMNT DATE, SPLT=~SECT, :-)8
01/22/88     Bryan Ford     SPEC=NAME+LEN, -LEVL, +PROT, BODY_HUFF=BODY_SQEZ
                            BODY_LZIV=BODY_CRNC+BODY_SQSH, +AICN
01/26/88     Bryan Ford     SPEC=SPEC+DATE, ~AICN
02/02/88     Bryan Ford     ~SPEC, ~SBDR, ~FORM_ILBM=~AICN, +BODY_TPAK,
                            +SORT, :-)8
                                                         .
If you want to know what this means, it's a shorthand   /|\
for English which I created - I can send mail to         |
curious people.   -Bryan                                 |

                              THE END

       Bryan Ford                  ///// A computer does what \\\\\
Snail: 1790 East 1400 North       ///// you tell it to do, not \\\\\
       Logan, UT 84321        \\\XX///  what you want it to do. \\\XX///
Email: USU@FATQW.BITNET        \XXXX/ Murphy's Law Calender 1986 \XXXX/