[comp.sys.amiga] IFF archive proposal - update

FATQW@USU.BITNET (01/24/88)

                                  A R C
                              ARC/FLST/TREE
                      An IFF File Archive Format
                             by Bryan Ford
                                  plus
                      Lots of people other people
                       Thanks for your comments!

   There are three sections in this document.  The first one describes the
ARC form, the second describes the FLST form, and the third one describes
the TREE form. Each are independent, but they "make sense" more when they
are put together.

   It is the almost unlimited nesting and expansion capability of the IFF
format that makes this file format possible.  Thanks EA!

   An archive file is made up of zero or more ARC chunks, with FLST chunks
as their "children." In other words, the ARC chunks are the tree and
branches, while the FLST chunks are the "leaves".  Also, a TREE may be
included in the beginning of an ARC to provide information about the
archive which would normally require seeking through the entire archive. In
addition, archives may be spread across multiple disks or other media for
backup purposes.

                    Section 1: FORM ARC - Archive

   The ARC form is a form for collecting more than one file into one file.
It can also specify subdirectories to be created before it is unarced, and
it can contain nested FORM ARCs as well as FLSTs.  A TREE chunk may be put
at the beginning of an ARC which would describe the content of the file
without having to scan the entire archive.

   SBDR - Subdirectory.  This chunk contains a string of characters
terminated by a null, specifying a subdirectory for this ARC to be unarced
into.  If the specified subdirectory does not already exist, the unarcing
program will create it in the directory specified by the user (or the
directory that a parent ARC was unarced into).  However, unarcing programs
should have the capability of overriding this and unarcing into the
specified directory without looking at this.  This chunk may or may not be
included in the top ARC form, but it MUST be included in all sub-ARCs.

   ANAM - Archiver name.  This contains a null-terminated string telling
the name of the program that created this archive.  This chunk, if included
at all, should only be included in the top-level ARC chunk.  It is provided
only for the benifit of the user.  Unarchivers will simply ignore this
chunk, except to print its contents at the user's request.  Archivers may
or may not include this chunk, according to the programmer.

   NEXT - Continuation chunk.  This chunk is provided in order to split
archives up over several volumes for backup purposes.  It is used in
combination with the PREV chunk.  It has a null-terminated string giving
the complete path to the next continuation file.  Each file is completely
separate, except they are linked with the NEXT and PREV chunks in the
FORM ARC, and the files may be split up between archives, making it
necessary to have access to more than one of the archives in order to
extract one file.

   PREV - Previous chunk.  This chunk points to the previous archive file
in the chain.  The NEXT chunk in the file pointed to by this PREV points to
this file.  In other words, the files form a doubly-linked list.

   FORM TREE - The archive tree.  There may be only one of these in any
archive, although it's not required.  If included, it should be in the
top-level ARC, and there should be one for every archive in a chain
(see NEXT), describing the files it contains.  They are provided to speed
up access to the files in the archive by providing useful information at
the beginning. This FORM must be before any ARCs or FLSTs. The format for
the TREE is described in section 3.  (Also from Miles Johnson)

   FORM ARC - A child ARC.  An ARC may contain other ARCs.  This is useful
for "sub-archives" which, when unarced, go into various directories
automatically.  A child ARC doesn't necessarily need a SBDR chunk, but it
makes little sense otherwise.

   FORM FLST - Files.  These chunks contain the actual files which make up
the archive. These are not necessary for an archive, and in some cases this
may be useful.  For one example, maybe an archive wants to create several
subdirectories which contain files, but no files in the root.  As another
example, you may want a child ARC to be completely empty except for a SBDR
chunk - for example a "saved games" directory without any saved games.  In
other words, this may be useful for just creating directories to be used
later, but not put any files in it.

                       Section 2: FORM FLST - File

   These FORMs contain files, and make up the archive "meat".  Each may
contain several chunks described below.  They may be split up into multiple
parts for data recovery purposes, as well as split over volumes to
facilitate making backups.

   SPEC - Filespec.  This chunk contains a longword length of the decoded
image, followed by a null-terminated filename. This is the only required
chunk in the FORM FLST.

   CMNT - Comment.  This is a null-terminated string containing any comment
for the file.  It should normally be less than about 40 characters or
so.

   DATE - Last modification date.  Adding, extracting, or copying will not
change this date.  The file must have its data changed in order to have the
date changed.  The date is a standard AmigaDOS date consisting of three
longwords.  The first is the number of days elapsed since January 1, 1978.
The second longword is the number of minutes elapsed since midnight of that
day. The third longword is the number of "ticks" since the beginning of
that minute. A tick is 1/50th of a second.  Not required, but recommended.

   SPLT - File splitting.  This chunk is only required if the file has
been split in any way.  If it is not included, the default is zero for the
first and third words, and one for the second (middle). It contains three
words of data: the number of BODY chunks in previous archive files (denoted
by the PREV chunk in the top ARC), the number of BODY chunks in this
archive file, and the number of sections in archive files after this one
(denoted by the NEXT chunk in the top ARC). This is actually a replacement
for the older SECT chunk.  It is used to split files into smaller, more
manageable parts, and to split files between volumes, for backup purposes.
If the first and third words are zero, this file has been split up, but not
between volumes.  If one or both is nonzero, it means that this file has
been split up over one or more separate archive files, so to extract this
file an unarchiver will have to access more than one archive.  If the first
word is nonzero, this file must be the first in this archive.  If the last
word is nonzero, it must be the last in the archive.  Both are possible at
the same time.  They indicate how many BODY chunks come before this
archive, in it, and after it.  The unarchiver should use the NEXT and PREV
chunks in the top ARC to find the other parts to this file.  Note:  If a
file is split up over several archive files, the same header chunks MUST be
exacly duplicated in all the parts.  This includes NAME, LEN, CMNT, PSWD,
DATE, and all other header chunks except those used for controlling the
splitting.
   Splitting up files even in only one archive can be very valuable for
data recovery.  If part of a file is munged, the rest of the file may be
salvaged. For example, an archiving program can break up the file into
multiple compression chunks, and include a SPLT chunk with the number of
compression chunks. Each compression chunk will contain, say, one page of
data. Archivers should try to be intelligent when they split a file. For
example, if it's a text file, split it after each page break. If it's an
IFF file, split it between the chunks.  If one part is bad, the rest will
get put together, so the user might still get part of the file back.

   PROT - File protection.  Warning: Hot subject.  This chunk signifies to
an unarchiver that this file is password protected, and it should let the
user enter a password and then decode the file according to the password
entered.  The password is not actually stored anywhere in the file, so
hackers can't write programs which simply look through the archive and
print the passwords.  Crackers have to "crack".  Anyway, this contains one
longword, which is an identifier for the type of encoding used, followed
optionally by data which applies to the particular encoding method
selected.  When the user enters a password, the unarchiver will decode the
file according to the password entered - if the password is wrong, the file
will simply be decoded as garbage.  Passwords must be mapped to uppercase
by archivers and unarchivers.  As of yet, I don't know of any specific
methods of encrypting files, so it's up to you guys to fill in the blanks!

   CRC - CRC check.  This chunk was modified from its original definition
to accomodate multiple program sections.  The chunk contains as many words
of data as there BODY chunks in this archive file - one CRC for each
section.  Note that this includes only the BODY chunks included within
this archive file, and not any that are in other archives, if the file is
split between volumes.  If there are too many CRC words, an unarchiver will
ignore the rest. If there are too few, the unarchiver will check only the
sections with CRCs supplied, and possibly give a warning to the user. If
there is no CRC chunk, no checking will be done.

   AICN - Amiga Icon.  This chunk is specifically for the Amiga.  It stores
icon information for the file.  It has exactly the same format as the .info
files.  Basically, its purpose is to package icons with programs without
having to have two entries with every file.  The difference between
this and storing raw .info files would be so the user, when he gets
an archive listing, gets somethings like "Test (with icon)", instead
of "Test" and "Test.info".

   BODY - These chunks comprise the actual data of the file, or the data
of one section of the file. There will be only one BODY chunk unless the
file is broken up into sections (see SPLT for details).  This chunk always
contains one longword at the beginning, which is an identifier for the
compression format, and the compressed bytes of the file.  The actual data
of the file will depend on the compression used. The predefined compression
algorithms are defined below.

   BODY/STOR - Storage without compression.  This is usually used for very
small files which would not gain anything in compression.  The chunk's data
is an exact duplicate of what will go into the file.

   BODY/PACK - Packing.  This algorithm is the simplest, and simply sticks
repetitious bytes together.

   BODY/LZIV - Lempel-Ziv encoding. This contains one byte which tells
the "number of bits" used, followed by the data.  Typically 12 (crunching)
or 13 (squashing) bits. As of yet, I don't have the docs for this format,
but as soon as I get them, I'll include them in a different doc file.

   BODY/HUFF - Huffman encoding.  I don't have any docs for this one
either, so I welcome any mail containing docs on this.

   When better compression algorithms come out, they may be added to the
BODY.  However, these will NOT be forward compatible - programs which
support them will not be compatible with programs which don't.

   Also, archiving programs aren't required to analyze files to make sure
they're getting the maximum efficiency.  A user may want to just always use
12-bit Lempel-Ziv encoding, since it's the most often used.  Archivers
probably should have an option to disable analyzing the file.

             Section 3: FORM TREE - Directory and file trees

   This form may appear only in a top-level ARC.  When used in an archive,
it describes the content of that archive, and makes scanning the archive
much faster.  It may also be useful alone for creating directory trees. The
chunks that apply specifically to ARCs are noted.

   DRNM - Direcotry name.  This contains a null-terminated string giving
the name of this volume or directory.  It must appear before any other
chunks in the FORM TREE.  It is required for all sub-trees, but not
required (although recommended) for the top-level TREE.

   SPEC - Filespec.  This chunk contains one longword containing the
lengthe of the file followed by a null-terminated string which names a file
in this directory or volume.  It may be followed by any of the following
"modifier" chunks which describe the file in more detail.

   POS - Position.  This chunk is valid only when used in archive files. It
contains one longword of data, which is the absolute offest from the
beginning of this archive file which the FORM FLST for this file will be
found.  This also may be used after subdirectory descriptions (FORM TREEs),
which tell where in the archive file the FORM ARC will be found.  Note that
if there is any suspicion of data corruption, this should NOT be used by an
unarchiver, since it uses absolute references into the file.  Also, the
user should have the option to disable both using these chunks when found,
and writing them to files.

   FORM TREE - Subdirectories.  This is describes a subdirectory within
this directory.  The name can be gotten from the NAME field within this
FORM.

   One final note: there is no requirement to sort archived files in any
way, although archivers may want to sort them for the sake of the user.
   Although this document is not copyrighted or anything, please don't
redistribute it very much.  This is because it's only a draft, and it will
probably get changed, and we want EVERYONE to have the same thing.  So if
you decide to send it to local BBSs or something, please take the
responsibility of updating them too.  Thanks.
   Please feel free to email suggestions for this file, as well as post to
Usenet.  I don't have access to BIX, so although there is no requirement
that you keep it to Usenet, I won't be able to respond on BIX, unless
somebody does some cross-posting for me.  Oh, and if anybody has the docs
for Lempel-Ziv and Huffman encoding, or any other "interesting" formats,
I'd appreciate it in my mailbox (address below).

                             History

  date         author                      changes
-------- ------------------ ---------------------------------------------
  ????       Bryan Ford     Gave birth to this file
01/08/88     Bryan Ford     +ANAM SECT, ~CRC
01/17/88     Bryan Ford     BODY=STOR+CRNC+PACK+SQEZ+SQSH, :-)8
01/20/88     Bryan Ford     +FORM_TREE NEXT PREV CMNT DATE, SPLT=~SECT, :-)8
01/22/88     Bryan Ford     SPEC=NAME+LEN, -LEVL, +PROT, BODY_HUFF=BODY_SQEZ
                            BODY_LZIV=BODY_CRNC+BODY_SQSH, +AICN
                                                         .
If you want to know what this means, it's a shorthand   /|\
for English which I created - I can send mail to         |
curious people.   -Bryan                                 |

                              THE END

       Bryan Ford                  ///// A computer does what \\\\\
Snail: 1790 East 1400 North       ///// you tell it to do, not \\\\\
       Logan, UT 84321        \\\XX///  what you want it to do. \\\XX///
Email: USU@FATQW.BITNET        \XXXX/ Murphy's Law Calender 1986 \XXXX/

bryce@hoser.berkeley.edu (Bryce Nesbitt) (01/24/88)

In article <8801240527.AA15462@jade.berkeley.edu> FATQW@USU.BITNET writes:
>
>                      ...An IFF File Archive Format...
>
>...AICN - Amiga Icon.  This chunk is specifically for the Amiga.  It stores
>icon information for the file.  It has exactly the same format as the .info
>files....

Please no!  It is the same format as a "Disk Object", as retrieved by
the "GetDiskObject()" libray call.  The de-arcer must put this back with the
"PutDiskObject()" library call.

".info" files are merely a side effect of the current Workbench implementation.
The only defined interface for making icons is "PutDiskObject()".

This is all in the "icon.library".   See the "Workbench" chapter in the RKM.


|\ /|  . Ack! (NAK, SOH, EOT)
{o O} . bryce@hoser.berkeley.EDU -or- ucbvax!hoser!bryce (or try "cogsci")
 (")
  U	"As an engineer, I only set the value of a product... not the cost."
	-Bryce Nesbitt