ariel@prime.com (Rob Ullmann) (03/27/91)
The following is the present draft of a one-pass compression and encoding of an arbitrary (octet-aligned) binary object, to make it mailable. It is a combination of LZJU90, xxencode, and a CRC check. Please direct comments to both co-authors. Rob Ullmann Network Working Group R. Jung, R. Ullmann Request for Comments: DRAFT Prime Computer, Inc. January 1991 LZJU90: Compressed Encoding for Binary Mail 1. Status of this Memo This memo describes an encoding [1] for a binary object to be sent in an Internet mail message. The encoding provides both compression and representation in a text format that will successfully survive transmission through the many different mailers and gateways that comprise the Internet and connected mail networks. Distribution of this memo is unlimited. 2. Introduction The encoding first compresses the binary object, using a modified LZ77 algorithm, called LZJU90. It then encodes each 6 bits of the output of the compression as a text character, using a character set chosen to survive any translations between codes, such as ASCII to EBCDIC. The 64 six-bit strings 000000 through 111111 are represented by the characters "+", "-", "0" to "9", "A" to "Z", and "a" to "z". The output text begins with a line identifying the encoding. This is for visual reference only, the Encoding: field in the header identifies the section to the user program. It also names the object that was encoded, usually by a file name. The format of this line is: * LZJU90 <name> where <name> is optional. For example: * LZJU90 vmunix This is followed by the compressed and encoded data, broken into lines where convenient. It is recommended that lines be broken every 78 characters, to survive mailers than restrict line length. The decoder must accept lines with 1 to 1000 characters on each line. After this, there is one final line that gives the number of bytes in the original data and a CRC of the original data. This should match the byte count and CRC found during decompression. This line has the format: * <count> <CRC> Jung, Ullmann [Page 1] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 where <count> is a decimal number, and CRC is 8 hexadecimal digits. For example: * 4128076 5AC2D50E The count used in the Encoding: field in the message header is the total number of lines, including the start and end lines that begin with *. A complete example is given in section 6. 3. Specification of the LZJU90 compression This data compression specification uses the Lempel-Ziv-Storer-Szymanski model of mixing pointers and literal characters. The data compression is defined by the decoding algorithm. Any encoder that emits symbols which cause the decoder to produce the original input is defined to be valid. There are many possible strategies for the maximal-string matching that the encoder does, section 5 gives the code for one such algorithm. Regardless of which algorithm is used, and what tradeoffs are made between compression ratio and execution speed or space, the result can always be decoded by the simple decoder. The compressed data consists of a mixture of unencoded literal characters and copy pointers which point to an earlier occurrence of the string to be encoded. Compressed data contains two types of codewords: LITERAL pass the literal directly to the uncompressed output COPY length, offset go back offset characters in the output and copy length characters forward to the current position. To distinguish between codewords, the copy length is used. A copy length of zero indicates that the following codeword is a literal codeword. A copy length greater than zero indicates that the following codeword is a copy codeword. To improve copy length encoding, a threshold value of 2 has been subtracted from the original copy length for copy codewords, because the minimum copy length is 3 in this compression scheme. The maximum offset value is set at 32255. Larger offsets offer extremely low improvements in compression (less than 1 percent, typically). No special encoding is done on the LITERAL characters. However, unary Jung, Ullmann [Page 2] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 encoding is used for the copy length and copy offset values to improve compression. A start-step-stop unary code is used. A (start, step, stop) unary code of the integers is defined as follows: The Nth codeword has N ones followed by a zero followed by a field of size START + (N * STEP). If the field width is equal to STOP then the preceding zero can be omitted. The integers are laid out sequentially through these codewords. For example, (0, 1, 4) would look like: Codeword Range 0 0 10x 1-2 110xx 3-6 1110xxx 7-14 1111xxxx 15-30 Below are the actual values used for copy length and copy offset: The copy length is encoded with a (0, 1, 7) code leading to a maximum copy length of 256 by including the THRESHOLD value of 2. Codeword Range 0 0 10x 3-4 110xx 5-8 1110xxx 9-16 11110xxxx 17-32 111110xxxxx 33-64 1111110xxxxxx 65-128 1111111xxxxxxx 129-256 The copy offset is encoded with a (9, 1, 14) code leading to a maximum copy offset of 32255. Offset 0 is reserved as an end of compressed data flag. Codeword Range 0xxxxxxxxx 0-511 10xxxxxxxxxx 512-1535 110xxxxxxxxxxx 1536-3583 1110xxxxxxxxxxxx 3485-7679 11110xxxxxxxxxxxxx 7680-15871 11111xxxxxxxxxxxxxx 15872-32255 Jung, Ullmann [Page 3] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 The 0 has been chosen to signal the start of the field for ease of encoding. The stop values are useful in the encoding to prevent out of range values for the lengths and offsets, as well as shortening some codes by one bit. The worst case compression using this scheme is a 1/8 increase in size of the encoded data. (One zero bit followed by 8 character bits). After the character encoding, the worst case ratio is 3/2 to the original data. The minimum copy length of 3 has been chosen because the worst case copy length and offset is 3 bits (3) and 19 bits (32255) for a total of 22 bits to encode a 3 character string (24 bits). Jung, Ullmann [Page 4] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 4. The Decoder As mentioned previously, the compression is defined by the decoder. Any encoder that produced output that is correctly decoded is by definition correct. The following is an implementation of the decoder, written more for clarity and as much portability as possible, rather than for maximum speed. When optimized for a specific environment, it will run significantly faster. /* LZJU 90 Decoding program */ #include <stdio.h> typedef unsigned char uchar; typedef unsigned int uint; #define N 32255 #define THRESHOLD 3 #define STRTP 9 #define STEPP 1 #define STOPP 14 #define STRTL 0 #define STEPL 1 #define STOPL 7 static FILE *in; static FILE *out; static int getbuf; static int getlen; static long in_count; static long out_count; static long crc; static long crctable[256]; static uchar xxcodes[] = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\ abcdefghijklmnopqrstuvwxyz"; static uchar ddcodes[256]; static uchar text[N]; Jung, Ullmann [Page 5] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 #define CRCPOLY 0xEDB88320 #define CRC_MASK 0xFFFFFFFF #define UPDATE_CRC(crc, c) \ crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \ ^ (crc >> 8) #define START_RECD "* LZJU90" void MakeCrctable() /* Initialize CRC-32 table */ { uint i, j; long r; for (i = 0; i <= 255; i++) { r = i; for (j = 8; j > 0; j--) { if (r & 1) r = (r >> 1) ^ CRCPOLY; else r >>= 1; } crctable[i] = r; } } int GetXX() /* Get xxcode and translate */ { int c; do { if ((c = fgetc(in)) == EOF) c = 0; } while (c == '\n'); in_count++; return ddcodes[c]; } int GetBit() /* Get one bit from input buffer */ { int c; while (getlen <= 0) { c = GetXX(); getbuf |= c << (10-getlen); getlen += 6; } c = (getbuf & 0x8000) != 0; getbuf <<= 1; getbuf &= 0xFFFF; getlen--; return(c); } Jung, Ullmann [Page 6] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 int GetBits(len) /* Get len bits */ int len; { int c; while (getlen <= 10) { c = GetXX(); getbuf |= c << (10-getlen); getlen += 6; } if (getlen < len) { c = (uint)getbuf >> (16-len); getbuf = GetXX(); c |= getbuf >> (6+getlen-len); getbuf <<= (10+len-getlen); getbuf &= 0xFFFF; getlen -= len - 6; } else { c = (uint)getbuf >> (16-len); getbuf <<= len; getbuf &= 0xFFFF; getlen -= len; } return(c); } int DecodePosition() /* Decode offset position pointer */ { int c; int width; int plus; int pwr; plus = 0; pwr = 1 << STRTP; for (width = STRTP; width < STOPP; width += STEPP) { c = GetBit(); if (c == 0) break; plus += pwr; pwr <<= 1; } if (width != 0) c = GetBits(width); c += plus; return(c); } Jung, Ullmann [Page 7] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 int DecodeLength() /* Decode code length */ { int c; int width; int plus; int pwr; plus = 0; pwr = 1 << STRTL; for (width = STRTL; width < STOPL; width += STEPL) { c = GetBit(); if (c == 0) break; plus += pwr; pwr <<= 1; } if (width != 0) c = GetBits(width); c += plus; return(c); } void InitCodes() /* Initialize decode table */ { int i; for (i = 0; i < 256; i++) ddcodes[i] = 0; for (i = 0; i < 64; i++) ddcodes[xxcodes[i]] = i; return; } main(ac, av) /* main program */ int ac; char **av; { int r; int j, k; int c; int pos; char buf[80]; char name[3]; long num, bytes; if (ac < 3) { fprintf(stderr, "usage: judecode in out\n"); exit(1); } in = fopen(av[1], "r"); if (!in){ fprintf(stderr, "Can't open %s\n", av[1]); exit(1); } Jung, Ullmann [Page 8] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 out = fopen(av[2], "w"); if (!out) { fprintf(stderr, "Can't open %s\n", av[2]); fclose(in); exit(1); } while (1) { if (fgets(buf, sizeof(buf), in) == NULL) { fprintf(stderr, "Unexpected EOF\n"); exit(1); } if (strncmp(buf, START_RECD, strlen(START_RECD)) == 0) break; } in_count = 0; out_count = 0; getbuf = 0; getlen = 0; InitCodes(); MakeCrctable(); crc = CRC_MASK; r = 0; while (feof(in) == 0) { c = DecodeLength(); if (c == 0) { c = GetBits(8); UPDATE_CRC(crc, c); out_count++; text[r] = c; fputc(c, out); if (++r >= N) r = 0; } else { pos = DecodePosition(); if (pos == 0) break; pos--; j = c + THRESHOLD - 1; pos = r - pos - 1; if (pos < 0) pos += N; Jung, Ullmann [Page 9] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 for (k = 0; k < j; k++) { c = text[pos]; text[r] = c; UPDATE_CRC(crc, c); out_count++; fputc(c, out); if (++r >= N) r = 0; if (++pos >= N) pos = 0; } } } fgetc(in); /* skip newline */ if (fscanf(in, "* %ld %lX", &bytes, &num) != 2) { fprintf(stderr, "CRC record not found\n"); exit(1); } else if (crc != num) { fprintf(stderr, "CRC error, expected %lX, found %lX\n", crc, num); exit(1); } else if (bytes != out_count) { fprintf(stderr, "File size error, expected %lu, found %lu\n", bytes, out_count); exit(1); } else fprintf(stderr, "File decoded to %lu bytes correctly\n", out_count); fclose(in); fclose(out); return; } Jung, Ullmann [Page 10] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 5. An example of an Encoder Many algorithms are possible for the encoder, with different tradeoffs between speed, size, and complexity. The following is a simple example program which is fairly efficient; more sophisticated implementations will run much faster, and in some cases produce somewhat better compression. This example also shows that the encoder need not use the entire window available. Not using the full window costs a small amount of compression, but can greatly increase the speed of some algorithms. /* LZJU 90 Encoding program */ #include <stdio.h> typedef unsigned char uchar; typedef unsigned int uint; #define N 8192 /* Size of window buffer */ #define F 256 /* Size of look-ahead buffer */ #define THRESHOLD 3 #define NIL N /* End of tree's node */ #define STRTP 9 #define STEPP 1 #define STOPP 14 #define STRTL 0 #define STEPL 1 #define STOPL 7 #define CHARSLINE 78 static FILE *in; static FILE *out; static int putlen; static int putbuf; static int char_ct; static long in_count; static long out_count; static long crc; static long crctable[256]; static uchar xxcodes[] = "+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\ abcdefghijklmnopqrstuvwxyz"; Jung, Ullmann [Page 11] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 static uchar text[N + F - 1]; static int match_position; static int match_length; static int lson[N + 1]; static int rson[N + 257]; static int dad[N + 1]; #define CRCPOLY 0xEDB88320 #define CRC_MASK 0xFFFFFFFF #define UPDATE_CRC(crc, c) \ crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \ ^ (crc >> 8) void MakeCrctable() /* Initialize CRC-32 table */ { uint i, j; long r; for (i = 0; i <= 255; i++) { r = i; for (j = 8; j > 0; j--) { if (r & 1) r = (r >> 1) ^ CRCPOLY; else r >>= 1; } crctable[i] = r; } } void PutXX(c) /* Translate and put xxcode */ int c; { c = xxcodes[c & 0x3F]; if (++char_ct > CHARSLINE) { char_ct = 1; fputc('\n', out); } fputc(c, out); out_count++; } Jung, Ullmann [Page 12] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 void PutBits(c, len) /* Put rightmost "len" bits of "c" */ int c, len; { c <<= 16 - len; c &= 0xFFFF; putbuf |= (uint) c >> putlen; c <<= 16 - putlen; c &= 0xFFFF; putlen += len; while (putlen >= 6) { PutXX(putbuf >> 10); putlen -= 6; putbuf <<= 6; putbuf &= 0xFFFF; putbuf |= (uint) c >> 10; c = 0; } } void EncodePosition(ch) /* Encode offset position pointer */ int ch; { int width; int prefix; int pwr; pwr = 1 << STRTP; for (width = STRTP; ch >= pwr; width += STEPP, pwr <<= 1) ch -= pwr; if ((prefix = width - STRTP) != 0) PutBits(0xffff, prefix); if (width < STOPP) width++; else if (width > STOPP) abort(); PutBits(ch, width); } Jung, Ullmann [Page 13] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 void EncodeLength(ch) /* Encode code length */ int ch; { int width; int prefix; int pwr; pwr = 1 << STRTL; for (width = STRTL; ch >= pwr; width += STEPL, pwr <<= 1) ch -= pwr; if ((prefix = width - STRTL) != 0) PutBits(0xffff, prefix); if (width < STOPL) width++; else if (width > STOPL) abort(); PutBits(ch, width); } void InitTree() { int i; for (i = N + 1; i <= N + 256; i++) rson[i] = NIL; for (i = 0; i < N; i++) dad[i] = NIL; } Jung, Ullmann [Page 14] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 /* Insert string of length F, text[r..r+F-1], into one of the trees (text[r]'th tree) and return the longest-match position and length via the global variables match_position and match_length. If match_length = F, then remove the old node in favor of the new one, because the old one will be deleted sooner. Note r plays double role, as tree node and position in buffer. */ void InsertNode(r) int r; { int i, cmp, p, c; uchar *key, *keyp, *txtp; cmp = 1; key = &text[r]; p = N + 1 + key[0]; rson[r] = lson[r] = NIL; match_length = 0; for ( ; ; ) { if (cmp >= 0) { if (rson[p] != NIL) p = rson[p]; else { rson[p] = r; dad[r] = p; return; } } else { if (lson[p] != NIL) p = lson[p]; else { lson[p] = r; dad[r] = p; return; } } txtp = &text[p]; keyp = key; if ((cmp = *++keyp - *++txtp) != 0) continue; for (i = 2; i < F; i++) if ((cmp = *++keyp - *++txtp) != 0) break; Jung, Ullmann [Page 15] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 if (i > match_length) { match_position = ((r - p) & (N - 1)); if ((match_length = i) >= F) break; } else if (i == match_length) { if ((c = ((r - p) & (N - 1))) < match_position) match_position = c; } } dad[r] = dad[p]; lson[r] = lson[p]; rson[r] = rson[p]; dad[lson[p]] = r; dad[rson[p]] = r; if (rson[dad[p]] == p) rson[dad[p]] = r; else lson[dad[p]] = r; dad[p] = NIL; } Jung, Ullmann [Page 16] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 void DeleteNode(p) /* Delete node p from tree */ int p; { int q; if (dad[p] == NIL) return; if (rson[p] == NIL) q = lson[p]; else if (lson[p] == NIL) q = rson[p]; else { q = lson[p]; if (rson[q] != NIL) { do { q = rson[q]; } while (rson[q] != NIL); rson[dad[q]] = lson[q]; dad[lson[q]] = dad[q]; lson[q] = lson[p]; dad[lson[p]] = q; } rson[q] = rson[p]; dad[rson[p]] = q; } dad[q] = dad[p]; if (rson[dad[p]] == p) rson[dad[p]] = q; else lson[dad[p]] = q; dad[p] = NIL; } main(ac, av) /* main program */ int ac; char **av; { int r, s, i, c; int last_match_length; int len; if (ac < 3) { fprintf(stderr, "usage: juencode in out\n"); exit(1); } in = fopen(av[1], "r"); if (!in) { fprintf(stderr, "Can't open %s\n", av[1]); exit(1); } Jung, Ullmann [Page 17] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 out = fopen(av[2], "w"); if (!out) { fprintf(stderr, "Can't open %s\n", av[2]); fclose(in); exit(1); } char_ct = 0; in_count = 0; out_count = 0; putbuf = 0; putlen = 0; MakeCrctable(); crc = CRC_MASK; fprintf(out, "* LZJU90 %s\n", av[1]); InitTree(); r = 0; s = 0; /* Fill lookahead buffer */ for (len = 0; len < F && (c = fgetc(in)) != EOF; len++) { UPDATE_CRC(crc, c); in_count++; text[s++] = c; } while (len > 0) { InsertNode(r); if (match_length > len) match_length = len; if (match_length < THRESHOLD) { EncodeLength(0); PutBits(text[r], 8); match_length = 1; } else { EncodeLength(match_length - THRESHOLD + 1); EncodePosition(match_position); } Jung, Ullmann [Page 18] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 last_match_length = match_length; for (i = 0; i < last_match_length && (c = fgetc(in)) != EOF; i++) { UPDATE_CRC(crc, c); in_count++; DeleteNode(s); text[s] = c; if (s < F - 1) text[s + N] = c; s = (s + 1) & (N - 1); } while (i++ < last_match_length) { DeleteNode(s); s = (s + 1) & (N - 1); len--; } r = (r + last_match_length) & (N - 1); } /* end compression indicator */ EncodeLength(1); EncodePosition(0); PutBits(0, 7); fprintf(out, "\n* %lu %08lX\n", in_count, crc); fprintf(stderr, "Encoded %lu bytes to %lu symbols\n", in_count, out_count); fclose(in); fclose(out); } 6. LZJU90 example The following is an example of an LZJU90 compressed object. Using this as source for the program in section 4 will reveal what it is. * LZJU90 example 8-mBtWA7WBVZ3dEBtnCNdU2WkE4owW+l4kkaApW+o4Ir0k33Ao4IE4kk bYtk1XY618NnCQl+OHQ61d+J8FZBVVCVdClZ2-LUI0v+I4EraItasHbG VVg7c8tdk2lCBtr3U86FZANVCdnAcUCNcAcbCMUCdicx0+u4wEETHcRM 7tZ2-6Btr268-Eh3cUAlmBth2-IUo3As42laIE2Ao4Yq4G-cHHT-wCEU 6tjBtnAci-I++ * 190 081E2601 References [1] David Robinson, Robert L. Ullmann. Encoding Header Field for Internet Messages. RFC 1154, Prime Computer, April, 1990. Jung, Ullmann [Page 19] RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991 Author's Address Robert Jung 2606 Village Road West Norwood, MA 02062 USA Phone: +1 617 769 5999 Email: robjung@world.std.com Robert Ullmann 10-30 Prime Computer, Inc. 500 Old Connecticut Path Framingham, MA 01701 USA Phone: +1 508 620 2800 x1736 Email: Ariel@Relay.Prime.COM Jung, Ullmann [Page 20]
madler@nntp-server.caltech.edu (Mark Adler) (03/27/91)
In response to Rob Ullmann's LZJU90 RFC draft which suggests a combined compression, xxencode, and crc checked format: In my humble opinion, the compression utility and the "expansion" utility used to make the binary result of the compression mailable should be separate utilities. The reason is that compression is a rapidly advancing field, and better methods will continue to arise and supplant old ones, whereas making a binary mailable is well understood, and very nearly optimal base 85 or base 95 (the former handles EBCDIC, the latter being ASCII only) methods can be used. xxencode, by the way, is not a nearly optimal method. Mark Adler madler@pooh.caltech.edu
brad@looking.on.ca (Brad Templeton) (03/27/91)
I also agree that this makes little sense, unless you are running on DOS and the software tools approach is more difficult. Keep the compressor and the asciifyer independent. You want to be able to switch compressors at will, or not compress at all. You also might want to switch ascii convertors at will. ABE was designed to do just about everything you want in an ASCII encoder. It can take inut from a pipe, such as your compression program. Why write two programs when "compressor <args> | abe <args>" will do it as a one-line shell script. Among ABE's features by the way are: Choice of ABE1 (94 characters) or ABE2 (86 characters) or UUENCODE encoding. Automatic blocking, with each block directed to a pipe (like "|mail user") if desired. Files are often smaller, and compress well. Most printable characters map to themselves, so strings in binaries are readable right in the encoding. All lines are indexed, so sort(1) can repair any random scrambling of lines or files. (This can be turned off.) Extraneous lines (news headers, comments, signatures etc.) are ignored, even in the middle of encodings. A PD tiny decoder is available to include with files for first time users. Files can be split up automatically into equal sized blocks. Blocks can contain redundant information so that the decoder can handle blocks in any order, even with reposted duplicates and extraneous articles. Files with blank regions can be constructed from multi-part encodings with damaged blocks. Multiple files can be placed in one encoding. The decoder is extremely general and configurable, and supports many features not currently found in the encoder, but which other encoder writers might fight useful. For example, if a multi part abe encoding was sent to you in random order, you can often just say "dabe <mailbox>" and get the files out. Abe was posted to comp.sources.misc some time ago, it's also on UUNET in the 'clarinet' ftp directory, I think. It's free. BTW ABE uses 86 characters in the ABE2 encoding format, which is EBCDIC proof. The characters to avoid are: ! ` [ \ ] ^ { | } ~ Plus spaces, newlines, DEL, and all control characters, of course. -- Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473
madler@nntp-server.caltech.edu (Mark Adler) (03/28/91)
Brad Templeton writes: >> BTW ABE uses 86 characters in the ABE2 encoding format, which is >> EBCDIC proof. The characters to avoid are: >> >> ! ` [ \ ] ^ { | } ~ But this only leaves 84 characters. Perhaps you didn't mean to include the braces ('{' and '}')--I believe those are EBCDIC-safe. That would give you 86. Mark Adler madler@pooh.caltech.edu