[net.micro] Encodings for Binary Files

SY.FDC@CU20B.COLUMBIA.EDU (Frank da Cruz) (10/24/86)
There already is a fairly widespread efficient method of encoding binary
files suitable for sending through electronic mail, for "punching" through
card-oriented networks (like BITNET), and for putting on transportable tape
formats like ANSI or OS SL -- the (in)famous Kermit "BOO file".  Like
UUENCODE, BOO is a 4-for-3 encoding and requires some bit manipulation
(splitting three 8-bit bytes up into four 6-bit bytes).  The major
difference between BOO and UUENCODE is the offset added to the 6-bit byte to
make it printable: UUENCODE uses 32 decimal (so that 0 comes out as a space)
and BOO uses 48, so that 0 comes out as "0".  The major consequence is that
UUENCODE files contain spaces and periods, BOO files do not, and therefore
BOO files are more transparent (for instance, BITNET will trim trailing
spaces; some mail software will treat lines starting with dots as commands).
(By the way, neither UUENCODE nor BOO files contain curly braces.)  In
addition, BOO files do compression of consecutive 0 bytes, long strings of
which are often found in .EXE files.  For instance, here is how various
encodings of the MS-DOS program FIND.EXE come out:

  Original   FIND.EXE   5796 bytes
  HEX        FIND.HEX  11888
  UUENCODE   FIND.UUE   8006
  BOO        FIND.BOO   1505

BOO file makers and decoders are available in the Kermit distribution as
MSB*.*.  The programs are fairly short -- the C-language BOO file maker,
MSBMKB.C, has about 140 lines of code.  The decoder (MSBPCT.*) comes in
various languages: MS Basic (about 65 lines), C (130 lines), Pascal (155),
and MS Assembler (235).  You can type in the (slow) Basic version, then
download a BOO file of one of the compiled versions, un-boo it with the
Basic version, and then use the faster compiled version henceforth.

This is not, however, an unvarnished plug for BOO files as a standard.
BOO files have several distinct shortcomings:

1. Although the filename (in plain text) precedes the file, it MUST be on
the first line.  It is not distinctively marked, as it is in UUENCODE format
(BOO files say "MSKERMIT.EXE", whereas UUE files say "Begin 777
MSKERMIT.EXE").  This means that there's no good way to find the "file
header" of a BOO file when it's preceded by junk (like mail headers).  It
also means there's no good way to concatenate more than one file together
into a single-file archive.

2. There's no error checking built in -- no length fields, no checksums or
CRCs.  This was skipped largely because of ASCII/EBCDIC considerations.
There should be an error-checking mechanism that is independent of the
character set (it's not obvious to me what this could be).

3. 4-for-3 encoding may be efficient for binary files, but it's
wasteful for text files, as Bill Silvert pointed out in his message.

I agree it's time for some kind of standard for encoding files for
transmission on or though "hostile" media like electronic mail, BITNET,
ANSI tape, IBM mainframes, etc.  Some goals might be:

. Transparency: no 8-bit or control characters, no spaces, {|}, etc.,
  no long lines (probably 72 is a reasonable maximum length).
. System independence: unlike Intel Hex files or MacBinary files, the
  encoding should not depend on or reflect any particular architecture
  or file system.
. Error detection (length fields and checksums and/or CRCs).
. Character set independence: use only non-troublesome printable characters.
. Compactness of encoding to minimize storage and transmission time.
. Ability to concatenate files, possibly of different types, and decode
  automatically into separate files of the appropriate types.
. Simplicity of decoding algorithm, so the decoding program can be
  typed in (or translated to another language) by a user that doesn't have it.

For simplicity, the encoded files must be stream format -- structured files
(like Macintosh applications, record-oriented VAX/VMS or VM/CMS files, etc)
will require a second level (e.g. MacBinary) or different method (e.g. BinHex)
of encoding.

BOO files and Bill's proposed encoding both achieve transparency in most
situations.  Error detection is the tricky area.  But if you base the
block check on the 8-bit binary values rather than the encoded printable
values, character set independence will be achieved so long as the
translation is OK.  I think I'd vote for a 12-bit checksum
rather than a CRC, since the 12 bits can hold the sum of 72 characters
each having the highest printable ASCII value (tilde = ASCII 126,
and 126x72 = 9072), and the checksum is much simpler to program in most
languages than the CRC, not to mention more understandable.  A 12-bit
checksum can be expressed in two 6-bit characters (encoded printably).

BOO files are compact for binary data, and wasteful for text data.  Bill's
idea of a uniform encoding for text and binary files is appealing and
deserves attention.  Alternatively, the "file header" can include an
indication of the encoding method used, e.g. "BOO" or "TEXT", and encoding
of text files can be skipped (this is not quite true -- there should be a
"canonic form" for text files, namely 7-bit ASCII, with records (lines)
terminated by CRLF; text files that did not fit this pattern should be
converted.)  Either encoding method would allow a mixture of text and binary
files in the same archive.  Sound familiar, sort of like ARC, or SQ/LIBR,
etc?  The difference comes in the final goal -- ARC, LIBR, and friends (not
to mention Huffman and LZW decompressors) are not short, simple programs
that anyone can type in if they don't have them.  And of course, ARC files
(and SQ'd, LIBR'd, etc, files) are themselves binary.  We need a simple
decoding algorithm that can be expressed in 50-100 lines (or less) of BASIC,
C, or Pascal code, or the idea will never fly.

Further comments and suggestions welcome.  Speaking as one who has to
distribute Kermit programs for hundreds of different systems, with different
encoding methods adopted for each, along with a host of encoding and
decoding programs, I would certainly like to see a uniform method for
encoding binary files for distribution emerge and achieve wide acceptance.

- Frank da Cruz, Columbia University
-------