[comp.sys.atari.8bit] ARC file format?

jrd@STONY-BROOK.SCRC.SYMBOLICS.COM (John R. Dunning) (09/26/87)

Does anyone know the format of .ARC files, as hacked by the ARCX and ARC
that were posted a while ago?  I've gotten sick of their slowness and
low memory requirements, and am contemplating whacking together some new
ones.  I've looked at dumps of the binaries, but it's not obvious how
the files are constructed (not surprising, if they're Huffman encoded).
Alternatively, does anyone know how to lay hands on the source for ARCX
and ARC?  Failing that, does anyone know how to contact the authors?
Thanks for any info.

knutsen@aramis.rutgers.edu (Mark Knutsen) (09/27/87)

In article <870925172005.1.JRD@GRACKLE.SCRC.Symbolics.COM> jrd@STONY-BROOK.SCRC.SYMBOLICS.COM (John R. Dunning) writes:

> Does anyone know the format of .ARC files, as hacked by the ARCX and ARC
> that were posted a while ago?  
>   Failing that, does anyone know how to contact the authors?

  I've read articles on the format of .ARC files, but none were in-depth
enough for your purposes.  In a nutshell, however, ARC determines
which of four compression techniques to use on each file it's asked to
archive, compresses them, and sticks them together.  The 8-bit version
of ARC never uses the 4th technique ("crunching") due to memory
restrictions, but the 8-bit ARCX can unARC files compressed with all
four techniques.
  I believe that the 8-bit ARC and ARCX were written in Lightspeed C
by the authors of that language, who frequent the GEnie Atari
Roundtable.  I can recall reading a message by one of the authors
explaining that it's pointless to attempt to improve on the speed of
the 8-bit ARC.  It's a very calculation-intensive application, and the
Atari is only so fast at these things.
  Still, if you're willing to tackle the task in assembler, you may be
able to speed it up a bit...

  On another note: I have discovered by much experimentation that ARC
1.2 tends to compress text files such that they contain an extra copy
of their last byte when unARCed.  This causes ARCX to generate
checksum errors, which can safely be ignored.  Regardless, I find it
safest to ARC things with ARC 1.1 and deARC them with ARCX 1.2.
-- 
_________________________________ Jersey    |||  _____________________________
ARPA: knutsen@rutgers.edu       |    Atari / | \ | GEnie GE Mail: M.KNUTSEN
UUCP: {...}!rutgers.edu!knutsen |  |||  Computer | The JACG BBS: (201)298-0161
--------------------------------- / | \    Group -----------------------------

hyc@umix.UUCP (09/30/87)

A long time ago I said I'd port ARC in Action!, but I never finished,
and my 800XL has since bit the dust. However, I have full docs and source
code in C if you want them. I'm currently porting ARC 5.20 to my ST.
A description of the ARC header follows....
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

ARC-FILE.INF, created by Keith Petersen, W8SDZ, 21-Sep-86, extracted
from UNARC.INF by Robert A. Freed.

From:     Robert A. Freed
Subject:  Technical Information for ARC files
Date:     June 24, 1986

Note: In the following discussion, UNARC refers to my CP/M-80 program
for extracting files from MSDOS ARCs.  The definitions of the ARC file
format are based on MSDOS ARC512.EXE.

ARCHIVE FILE FORMAT
-------------------

Component files are stored sequentially within an archive.  Each entry
is preceded by a 29-byte header, which contains the directory
information.  There is no wasted space between entries.  (This is in
contrast to the centralized directory used by Novosielski libraries.
Although random access to subfiles within an archive can be noticeably
slower than with libraries, archives do have the advantage of not
requiring pre-allocation of directory space.)

Archive entries are normally maintained in sorted name order.  The
format of the 29-byte archive header is as follows:

Byte 1:  1A Hex.
         This marks the start of an archive header.  If this byte is not found 
         when expected, UNARC will scan forward in the file (up to 64K bytes) 
         in an attempt to find it (followed by a valid compression version).  
         If a valid header is found in this manner, a warning message is 
         issued and archive file processing continues.  Otherwise, the file is 
         assumed to be an invalid archive and processing is aborted.  (This is 
         compatible with MS-DOS ARC version 5.12).  Note that a special 
         exception is made at the beginning of an archive file, to accomodate 
         "self-unpacking" archives (see below).

Byte 2:  Compression version, as follows:

         0 = end of file marker (remaining bytes not present)
         1 = unpacked (obsolete)
         2 = unpacked
         3 = packed
         4 = squeezed (after packing)
         5 = crunched (obsolete)
         6 = crunched (after packing) (obsolete)
         7 = crunched (after packing, using faster hash algorithm) (obsolete)
         8 = crunched (after packing, using dynamic LZW variations)

Bytes 3-15:  ASCII file name, nul-terminated.

(All of the following numeric values are stored low-byte first.)

Bytes 16-19:  Compressed file size in bytes.

Bytes 20-21:  File date, in 16-bit MS-DOS format:
              Bits 15:9 = year - 1980
              Bits  8:5 = month of year
              Bits  4:0 = day of month
              (All zero means no date.)

Bytes 22-23:  File time, in 16-bit MS-DOS format:
              Bits 15:11 = hour (24-hour clock)
              Bits 10:5  = minute
              Bits  4:0  = second/2 (not displayed by UNARC)

Bytes 24-25:  Cyclic redundancy check (CRC) value (see below).

Bytes 26-29:  Original (uncompressed) file length in bytes.
              (This field is not present for version 1 entries, byte 2 = 1.  
              I.e., in this case the header is only 25 bytes long.  Because 
              version 1 files are uncompressed, the value normally found in 
              this field may be obtained from bytes 16-19.)


SELF-UNPACKING ARCHIVES
-----------------------

A "self-unpacking" archive is one which can be renamed to a .COM file
and executed as a program.  An example of such a file is the MS-DOS
program ARC512.COM, which is a standard archive file preceded by a
three-byte jump instruction.  The first entry in this file is a simple
"bootstrap" program in uncompressed form, which loads the subfile
ARC.EXE (also uncompressed) into memory and passes control to it.  In
anticipation of a similar scheme for future distribution of UNARC, the
program permits up to three bytes to precede the first header in an
archive file (with no error message).


CRC COMPUTATION
---------------

Archive files use a 16-bit cyclic redundancy check (CRC) for error
control.  The particular CRC polynomial used is x^16 + x^15 + x^2 + 1,
which is commonly known as "CRC-16" and is used in many data
transmission protocols (e.g. DEC DDCMP and IBM BSC), as well as by
most floppy disk controllers.  Note that this differs from the CCITT
polynomial (x^16 + x^12 + x^5 + 1), which is used by the XMODEM-CRC
protocol and the public domain CHEK program (although these do not
adhere strictly to the CCITT standard).  The MS-DOS ARC program does
perform a mathematically sound and accurate CRC calculation.  (We
mention this because it contrasts with some unfortunately popular
public domain programs we have witnessed, which from time immemorial
have based their calculation on an obscure magazine article which
contained a typographical error!)

Additional note (while we are on the subject of CRC's): The validity
of using a 16-bit CRC for checking an entire file is somewhat
questionable.  Many people quote the statistics related to these
functions (e.g. "all two-bit errors, all single burst errors of 16 or
fewer bits, 99.997% of all single 17-bit burst errors, etc."), without
realizing that these claims are valid only if the total number of bits
checked is less than 32767 (which is why they are used in small-packet
data transmission protocols).  I.e., for file sizes in excess of about
4K bytes, a 16-bit CRC is not really as good as what is often claimed.
This is not to say that it is bad, but there are more reliable methods
available (e.g. the 32-bit AUTODIN-II polynomial).  (End of lecture!)

                           Bob Freed
                           62 Miller Road
                           Newton Centre, MA  02159
                           Telephone (617) 332-3533


-- 
  -- Howard Chu
	"Of *course* it's portable. It's written in C, isn't it?"