xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (03/26/91)
d88-jwa@byse.nada.kth.se (Jon W{tte) writes: > The previously suggested way of using arithmetic coding for > uuencode-style data encoding into printable characters seems cool, but > a simple "base 85" mapping from four bytes to five chars (such as used > by atob) and vice versa does the trick with _very_ close to optimal > performance. The problem is; it requires longword modulo and/or > division, which has to be simulated on weaker processors (such as > 68000 or (yukk !) 8086) Was that mine? It seems like months since I remember putting that claim up. It had a couple of advantages: o arithmetic encoding produces its output data bitwise anyway, so it is almost no extra trouble to capture it in groups of 13 bits instead of 8 or 16 bits. o 2^13 is just a smidgen less than 95*95, which makes close to optimal use of two bytes to encode 13 bits; let's see: 13 4 65 1 65 64 -- : - :: -- : - :: -- : -- 16 5 64 1 80 80 so the encoding of 13 bits into 16 is just a smidgen more efficient than the btoa 32 bits into 40 bits encoding, if I did that right, but is a bit more work _except_ in the case where arithmetic encoding or other bit-by-bit compression output is already being done anyway. I'd love to see someone (me if I were in my right mind) take lharc and replace the Huffman coding step by an arithmetic encoding step; if it were at least as efficient an overall compression method (which seems gut-level likely, since Huffman coding loses efficiency compared to arithmetic compression by having to output an even number of bits per compressed input symbol), the nice bit by bit output would quickly seduce one into forming a command line flag option to make the output mailable ASCII, instead of the default binary, which would then replace the current "compress then encode for mailing" paradigm by a single step, integrated solution. Kent, the man from xanth. <xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us> -- Does btoa really only use base _eighty_-five? Doesn't look like it could and still be that close to my a.e. solution in efficiency. A typo, perhaps?
d88-jwa@byse.nada.kth.se (Jon W{tte) (03/26/91)
In article <> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:
Was that mine? It seems like months since I remember putting that
claim up.
Nope. It was recent. I talked to the author via mail, but wanted to
point this out to the group.
Does btoa really only use base _eighty_-five? Doesn't look like it could
and still be that close to my a.e. solution in efficiency. A typo,
perhaps?
No. 85^5 > 256^4 (try it yourself !) This results in an axpansion of
20%, just a little bit more than the theoretical base-9[56] coder.
That's logarithms at work for you ! :-)
The nice thing about base 85 is that you can choose the character set
to exclude certain ill-behaving characters (notably tilde (in cu) and
spce (as in the BSD tty driver :-))
h+@nada.kth.se
Jon W{tte
--
"The IM-IV file manager chapter documents zillions of calls, all of which
seem to do almost the same thing and none of which seem to do what I want
them to do." -- Juri Munkki in comp.sys.mac.programmer
tgl@g.gp.cs.cmu.edu (Tom Lane) (03/26/91)
There's a vital point that several posters in this thread seem to have missed. If you want a representation that is safe to mail around the world, *you can't use all ninety-five ASCII symbols*. Some mailers translate mail into other character sets (can you say EBCDIC?). Some of the less common ASCII symbols do not appear in EBCDIC, and unfortunately there isn't a standardized translation. The authors of btoa did the necessary research and concluded that there are eighty-five ASCII symbols that are safe to mail. So that's why btoa uses base 85. Any practical new scheme for mailable representations is not going to be much better. Incidentally, by my calculation, 85 symbols give you log2(85) = 6.41 bits per symbol, where 95 give you 6.57. So we are only talking a three percent difference in size anyway. (I have no idea whether btoa reaches the full 6.41 bits per symbol. More likely they translate four source bytes into five output symbols, or 6.4 bits/symbol.) -- tom lane Internet: tgl@cs.cmu.edu BITNET: tgl%cs.cmu.edu@cmuccvma
madler@nntp-server.caltech.edu (Mark Adler) (03/27/91)
I have a program called "ship" (written in C) that works on a wide range of Unix machines, as well as MSDOS, that uses base 85 coding as well as automatically splits mail into pieces if requested, mails them if requested, reassembles them at the other end, and also does a four byte CRC check on the result. I can send it to whoever is interested. Mark Adler madler@pooh.caltech.edu
madler@nntp-server.caltech.edu (Mark Adler) (03/27/91)
By the way, my ship program uses EBCDIC-safe characters and no spaces. Mark Adler madler@pooh.caltech.edu
tgl@g.gp.cs.cmu.edu (Tom Lane) (03/27/91)
In article <12485@pt.cs.cmu.edu>, I said: > The authors of btoa did the necessary research and concluded that there are > eighty-five ASCII symbols that are safe to mail. So that's why btoa uses > base 85. After looking at the recently posted source code, it's clear that btoa uses base 85 because that's what you need for 6.4 bits/symbol (four input bytes map to five output characters). 2^6.4 = 84.45, so base 84 is too small. They avoid the ASCII characters above 'z' (codes 123-127), plus they leave five lowercase letters unused. (Actually they use 'x' and 'z' for special purposes: 'x' is the EOF mark and 'z' represents four zero input bytes.) *If* the btoa authors are right about which ASCII characters are safe to mail (and I am no longer willing to take that as certain) then you could theoretically go up to base 90 in a mail-safe representation. But that gets you log2(90) = 6.49 bits/symbol which is hard to use efficiently, and it's only 1% less space than btoa's 6.4 bits/symbol. (In some cases, btoa's use of 'z' to represent four zeros would buy a lot more than 1%.) I believe btoa is pretty close to the optimal mailable representation of arbitrary binary data. You can't possibly do more than about 2.4% better (using 94 ASCII characters), and given practical considerations such as unsafe characters and conversion speed, btoa looks pretty good. However, it'd be nice to hear from someone who uses non-ASCII mailers about just which characters are safe. (Anybody have an EBCDIC character chart handy?) One thing that might be good to avoid is putting '.' at the start of a line, which btoa will do quite happily. -- tom lane Internet: tgl@cs.cmu.edu BITNET: tgl%cs.cmu.edu@cmuccvma
madler@nntp-server.caltech.edu (Mark Adler) (03/27/91)
In article <12489@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes: >However, it'd be nice to hear from someone who uses non-ASCII mailers about >just which characters are safe. (Anybody have an EBCDIC character chart >handy?) One thing that might be good to avoid is putting '.' at the start >of a line, which btoa will do quite happily. The set of 85 characters that I use in "ship" was originally selected for a full screen file transfer program on IBM 3270 terminals emulated through an ASCII terminal, so it had to go through a lot of EBCDIC paths, as well as get translated to and from ASCII. At the time, I checked several IBM ASCII/EBCDIC translation tables (all different, of course) and picked a set that had common translation in all of them. And the set works through IBM (RSCS) mail. Your suggestion about avoiding a '.' in the first character is interesting. Why? Are the any other things to avoid that people out there knows messes up mailers? I already know about an indented first line getting chewed up sometimes. One thing that I think we know is that lines lines that start with 'M' (as in uuencode) work through Unix mail. However, one character per line seems like too much overhead to me. It would not be hard to detect troublesome lines generated by the base 85 encoding and stick a character in front that will be ignored later. Mark Adler madler@pooh.caltech.edu
mcastle@mcs213c.cs.umr.edu (Mike Castle {Nexus}) (03/27/91)
In article <12489@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes: > >They avoid the ASCII characters above 'z' (codes 123-127), plus they leave >five lowercase letters unused. (Actually they use 'x' and 'z' for special >However, it'd be nice to hear from someone who uses non-ASCII mailers about >just which characters are safe. (Anybody have an EBCDIC character chart >handy?) One thing that might be good to avoid is putting '.' at the start >of a line, which btoa will do quite happily. > >-- > tom lane >Internet: tgl@cs.cmu.edu BITNET: tgl%cs.cmu.edu@cmuccvma You need an EBCDIC chart? Got one right here. :-> In doing a quick visual comparison of ASCII and EBCDIC (ASCII chart from Turbo Pascal 5.0 manual, EBCDIC from Assembler Language Programming: The IBM System/370 Family by George Struble), and limiting myself to the ASCII chars from 32 (space) to 126 (tilde) I notice the following differences: Ascii 91 ([ left bracket) dne in EBCDIC Ascii 93 (] right bracket) dne in EBCDIC Ascii 94 (^ caret) dne in EBCDIC Ascii 96 (` back quote) dne in EBCDIC 126-32+1=95 characters 96 if you count then cent sign (which procomm refuses to let me enter with alt-155) Just to add confusion to all this :->, brackets [] are displayable on 3270 terminals, they're just not what the ascii bracket get mapped to, and don't really exist in EBCDIC (actually, lot's of neat characters you can get 3270 terminals to display that don't exist EBCDIC). If anyone wants to test mailers, mail me your sample sets at s087891@umrvma.umr.edu -- Mike Castle (Nexus) S087891@UMRVMA.UMR.EDU (preferred) | XEDIT: Emacs mcastle@mcs213k.cs.umr.edu (unix mail-YEACH!)| on a REAL Life is like a clock: You can work constantly, and be right | operating all the time, or not work at all, and be right twice a day. | system. :->
tgl@g.gp.cs.cmu.edu (Tom Lane) (03/27/91)
I said: > One thing that might be good to avoid is putting '.' at the start > of a line, which btoa will do quite happily. and Mark Adler said: > Why? Sorry to be obscure. NNTP, the standard protocol for shipping netnews messages across Internet, uses '.' alone on a line as end-of-message marker. To ensure that a data line like this can't be mistaken as end-of-message, any '.' appearing at the start of a data line is doubled (sent as '..') and then undoubled at the receiver. Thus, '.' at the start of a line is sent correctly, but you pay extra to send it across the network link. I believe NNTP is derived from the network mail transfer protocol, so the same effect probably holds for ordinary mail messages. Thus, it would be marginally more efficient if we could ensure '.' didn't appear at the start of a line. The only simple way I can see to do this is not to use '.' at all in the encoding character set. If we can get 85 safe characters without '.', then it'd be a win. > Are the any other things to avoid that people out > there knows messes up mailers? I already know about an indented first > line getting chewed up sometimes. I think it's pretty clear that we don't want to use space as one of the encoding characters: it's likely to get lost if it is the first or last character in a line, some overenthusiastic programs may translate runs of spaces into tabs, etc etc. The '.' business and EBCDIC compatibility are the only other constraints I know about. Mark, would you mind posting the character set you use for 'ship'? I wasn't paying attention at the time... -- tom lane Internet: tgl@cs.cmu.edu BITNET: tgl%cs.cmu.edu@cmuccvma
peltz@cerl.uiuc.edu (Steve Peltz) (03/29/91)
In article <12496@pt.cs.cmu.edu> tgl@g.gp.cs.cmu.edu (Tom Lane) writes: >The '.' business and EBCDIC compatibility are the only other constraints >I know about. Although, as you point out, a '.' at the beginning isn't forbidden, it might be best to avoid it simply to keep broken implementations from messing it up. Another possible constraint would be with a line that starts with "From". This often gets translated to ">From" (and even variants in case sometimes get translated). That's being overzealous in most cases (my mailer only treats a line that begins with "From ", including the trailing space, as being a message separator, and plenty of systems don't even care about that. -- Steve Peltz Internet: peltz@cerl.uiuc.edu PLATO/NovaNET: peltz/s/cerl