[comp.compression] COMPRESSING of binary data into mailable ASCII

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (03/26/91)

 d88-jwa@byse.nada.kth.se (Jon W{tte) writes:

> The previously suggested way of using arithmetic coding for
> uuencode-style data encoding into printable characters seems cool, but
> a simple "base 85" mapping from four bytes to five chars (such as used
> by atob) and vice versa does the trick with _very_ close to optimal
> performance. The problem is; it requires longword modulo and/or
> division, which has to be simulated on weaker processors (such as
> 68000 or (yukk !) 8086)

Was that mine?  It seems like  months since I remember putting that claim up.
It had a couple of advantages:

  o arithmetic encoding produces its output data bitwise anyway, so it
    is almost no extra trouble to capture it in groups of 13 bits instead
    of 8 or 16 bits.

  o 2^13 is just a smidgen less than 95*95, which makes close to optimal
    use of two bytes to encode 13 bits; let's see:

    13   4     65   1      65   64
    -- : -  :: -- : -  ::  -- : --
    16   5     64   1      80   80

    so the encoding of 13 bits into 16 is just a smidgen more efficient
    than the btoa 32 bits into 40 bits encoding, if I did that right, but
    is a bit more work _except_ in the case where arithmetic encoding or
    other bit-by-bit compression output is already being done anyway.

I'd love to see someone (me if I were in my right mind) take lharc and
replace the Huffman coding step by an arithmetic encoding step; if it
were at least as efficient an overall compression method (which seems
gut-level likely, since Huffman coding loses efficiency compared to
arithmetic compression by having to output an even number of bits per
compressed input symbol), the nice bit by bit output would quickly
seduce one into forming a command line flag option to make the output
mailable ASCII, instead of the default binary, which would then replace
the current "compress then encode for mailing" paradigm by a single
step, integrated solution.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>
--
Does btoa really only use base _eighty_-five? Doesn't look like it could
and still be that close to my a.e. solution in efficiency. A typo,
perhaps?

d88-jwa@byse.nada.kth.se (Jon W{tte) (03/26/91)

In article <> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:

   Was that mine?  It seems like  months since I remember putting that
   claim up.

Nope. It was recent. I talked to the author via mail, but wanted to
point this out to the group.

   Does btoa really only use base _eighty_-five? Doesn't look like it could
   and still be that close to my a.e. solution in efficiency. A typo,
   perhaps?

No. 85^5 > 256^4 (try it yourself !) This results in an axpansion of
20%, just a little bit more than the theoretical base-9[56] coder.
That's logarithms at work for you ! :-)

The nice thing about base 85 is that you can choose the character set
to exclude certain ill-behaving characters (notably tilde (in cu) and
spce (as in the BSD tty driver :-))

							h+@nada.kth.se
							Jon W{tte
--
"The IM-IV file manager chapter documents zillions of calls, all of which
seem to do almost the same thing and none of which seem to do what I want
them to do."  --  Juri Munkki in comp.sys.mac.programmer

tgl@g.gp.cs.cmu.edu (Tom Lane) (03/26/91)

There's a vital point that several posters in this thread seem to have
missed.  If you want a representation that is safe to mail around the world,
*you can't use all ninety-five ASCII symbols*.  Some mailers translate mail
into other character sets (can you say EBCDIC?).  Some of the less common
ASCII symbols do not appear in EBCDIC, and unfortunately there isn't a
standardized translation.

The authors of btoa did the necessary research and concluded that there are
eighty-five ASCII symbols that are safe to mail.  So that's why btoa uses
base 85.  Any practical new scheme for mailable representations is not going
to be much better.

Incidentally, by my calculation, 85 symbols give you log2(85) = 6.41 bits
per symbol, where 95 give you 6.57.  So we are only talking a three percent
difference in size anyway.  (I have no idea whether btoa reaches the full
6.41 bits per symbol.  More likely they translate four source bytes into
five output symbols, or 6.4 bits/symbol.)

-- 
			tom lane
Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma

madler@nntp-server.caltech.edu (Mark Adler) (03/27/91)

I have a program called "ship" (written in C) that works on a wide
range of Unix machines, as well as MSDOS, that uses base 85 coding
as well as automatically splits mail into pieces if requested, mails
them if requested, reassembles them at the other end, and also does
a four byte CRC check on the result.  I can send it to whoever is
interested.

Mark Adler
madler@pooh.caltech.edu

madler@nntp-server.caltech.edu (Mark Adler) (03/27/91)

By the way, my ship program uses EBCDIC-safe characters and no spaces.

Mark Adler
madler@pooh.caltech.edu

tgl@g.gp.cs.cmu.edu (Tom Lane) (03/27/91)

In article <12485@pt.cs.cmu.edu>, I said:
> The authors of btoa did the necessary research and concluded that there are
> eighty-five ASCII symbols that are safe to mail.  So that's why btoa uses
> base 85.

After looking at the recently posted source code, it's clear that btoa uses
base 85 because that's what you need for 6.4 bits/symbol (four input bytes
map to five output characters).  2^6.4 = 84.45, so base 84 is too small.

They avoid the ASCII characters above 'z' (codes 123-127), plus they leave
five lowercase letters unused.  (Actually they use 'x' and 'z' for special
purposes: 'x' is the EOF mark and 'z' represents four zero input bytes.)

*If* the btoa authors are right about which ASCII characters are safe to
mail (and I am no longer willing to take that as certain) then you could
theoretically go up to base 90 in a mail-safe representation.  But that gets
you log2(90) = 6.49 bits/symbol which is hard to use efficiently, and it's
only 1% less space than btoa's 6.4 bits/symbol.  (In some cases, btoa's use
of 'z' to represent four zeros would buy a lot more than 1%.)

I believe btoa is pretty close to the optimal mailable representation of
arbitrary binary data.  You can't possibly do more than about 2.4% better
(using 94 ASCII characters), and given practical considerations such as
unsafe characters and conversion speed, btoa looks pretty good.

However, it'd be nice to hear from someone who uses non-ASCII mailers about
just which characters are safe.  (Anybody have an EBCDIC character chart
handy?)  One thing that might be good to avoid is putting '.' at the start
of a line, which btoa will do quite happily.

-- 
			tom lane
Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma

madler@nntp-server.caltech.edu (Mark Adler) (03/27/91)

In article <12489@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes:
>However, it'd be nice to hear from someone who uses non-ASCII mailers about
>just which characters are safe.  (Anybody have an EBCDIC character chart
>handy?)  One thing that might be good to avoid is putting '.' at the start
>of a line, which btoa will do quite happily.

The set of 85 characters that I use in "ship" was originally selected
for a full screen file transfer program on IBM 3270 terminals emulated
through an ASCII terminal, so it had to go through a lot of EBCDIC
paths, as well as get translated to and from ASCII.  At the time, I
checked several IBM ASCII/EBCDIC translation tables (all different,
of course) and picked a set that had common translation in all of them.
And the set works through IBM (RSCS) mail.

Your suggestion about avoiding a '.' in the first character is
interesting.  Why?  Are the any other things to avoid that people out
there knows messes up mailers?  I already know about an indented first
line getting chewed up sometimes.  One thing that I think we know is
that lines lines that start with 'M' (as in uuencode) work through
Unix mail.  However, one character per line seems like too much
overhead to me.  It would not be hard to detect troublesome lines
generated by the base 85 encoding and stick a character in front
that will be ignored later.

Mark Adler
madler@pooh.caltech.edu

mcastle@mcs213c.cs.umr.edu (Mike Castle {Nexus}) (03/27/91)

In article <12489@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes:
>
>They avoid the ASCII characters above 'z' (codes 123-127), plus they leave
>five lowercase letters unused.  (Actually they use 'x' and 'z' for special
>However, it'd be nice to hear from someone who uses non-ASCII mailers about
>just which characters are safe.  (Anybody have an EBCDIC character chart
>handy?)  One thing that might be good to avoid is putting '.' at the start
>of a line, which btoa will do quite happily.
>
>-- 
>			tom lane
>Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma


You need an EBCDIC chart?  Got one right here.  :->  

In doing a quick visual comparison of ASCII and EBCDIC (ASCII chart from 
Turbo Pascal 5.0 manual, EBCDIC from Assembler Language Programming: The 
IBM System/370 Family by George Struble), and limiting myself to the ASCII
chars from 32 (space) to 126 (tilde) I notice the following differences:

Ascii 91 ([ left bracket) dne in EBCDIC
Ascii 93 (] right bracket) dne in EBCDIC
Ascii 94 (^ caret) dne in EBCDIC           
Ascii 96 (` back quote) dne in EBCDIC

126-32+1=95 characters   96 if you count then cent sign (which procomm refuses
to let me enter with alt-155)

Just to add confusion to all this :->, brackets [] are displayable on 3270
terminals, they're just not what the ascii bracket get mapped to, and don't 
really exist in EBCDIC (actually, lot's of neat characters you can get 3270
terminals to display that don't exist EBCDIC).

If anyone wants to test mailers, mail me your sample sets at
s087891@umrvma.umr.edu
-- 
Mike Castle (Nexus) S087891@UMRVMA.UMR.EDU (preferred)       | XEDIT: Emacs
                mcastle@mcs213k.cs.umr.edu (unix mail-YEACH!)| on a REAL
Life is like a clock:  You can work constantly, and be right | operating
all the time, or not work at all, and be right twice a day.  | system. :->

tgl@g.gp.cs.cmu.edu (Tom Lane) (03/27/91)

I said:
> One thing that might be good to avoid is putting '.' at the start
> of a line, which btoa will do quite happily.
and Mark Adler said:
> Why?

Sorry to be obscure.  NNTP, the standard protocol for shipping netnews
messages across Internet, uses '.' alone on a line as end-of-message
marker.  To ensure that a data line like this can't be mistaken as
end-of-message, any '.' appearing at the start of a data line is doubled
(sent as '..') and then undoubled at the receiver.  Thus, '.' at the
start of a line is sent correctly, but you pay extra to send it across
the network link.

I believe NNTP is derived from the network mail transfer protocol, so
the same effect probably holds for ordinary mail messages.

Thus, it would be marginally more efficient if we could ensure '.'
didn't appear at the start of a line.  The only simple way I can see to
do this is not to use '.' at all in the encoding character set.  If we
can get 85 safe characters without '.', then it'd be a win.

> Are the any other things to avoid that people out
> there knows messes up mailers?  I already know about an indented first
> line getting chewed up sometimes.

I think it's pretty clear that we don't want to use space as one of the
encoding characters: it's likely to get lost if it is the first or last
character in a line, some overenthusiastic programs may translate runs
of spaces into tabs, etc etc.

The '.' business and EBCDIC compatibility are the only other constraints
I know about.

Mark, would you mind posting the character set you use for 'ship'?
I wasn't paying attention at the time...

-- 
			tom lane
Internet: tgl@cs.cmu.edu	BITNET: tgl%cs.cmu.edu@cmuccvma

peltz@cerl.uiuc.edu (Steve Peltz) (03/29/91)

In article <12496@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes:
>The '.' business and EBCDIC compatibility are the only other constraints
>I know about.

Although, as you point out, a '.' at the beginning isn't forbidden, it might
be best to avoid it simply to keep broken implementations from messing it
up.

Another possible constraint would be with a line that starts with "From". This
often gets translated to ">From" (and even variants in case sometimes get
translated). That's being overzealous in most cases (my mailer only treats
a line that begins with "From ", including the trailing space, as being a
message separator, and plenty of systems don't even care about that.
--
Steve Peltz
Internet: peltz@cerl.uiuc.edu	PLATO/NovaNET: peltz/s/cerl