[net.micro.atari16] New way to post binaries - disc

barada@maggot.applicon.UUCP (10/24/86)

>
>As a followup to my previous article, I have come across a new encoding
>technique (courtesy of the Math Faculty Computing Facility here at
>the university) which has the following properties:
>
>	- eight bit input data
>	- very restricted character set
>	- generally compresses files rather than expanding them
>	  (including binaries)
>	- need no bit level twiddling
>	- involves only table lookup
>
>The encoding system is easy to implement.  I am whipping one up now.
>
>This should make the transmittable archive format very easy to implement.
>My co-developer, Mike Gore, is planning to do a Basic (gack spew) version
>after I write one in C.
>
>Still:  comments?
>
>Tracy Tims
>mail to ihnp4!watmath!unit36!tracy

Tracy, I followed your idea of last posting to do strange things with
restricted character sets, and I think I have found a nice way to do encoding/
decoding without too much overhead. I'll send you the sources if you want.

Encoding scheme:

	Assumptions:

		The character set contains the digits 0-9,
		and at LEAST the letters A-P.

	The letters A-P are the Hex values 0-15. Two hex "digits" are always
together. So AA is 0, and PP is 255 (standard hex, with different characters).
The digits 0-9 are used as a decimal repeat count. The count is 1 less than
the number of characters to repeat. This implies that a repeat count of 1
is interpreted as "two of the following"(obviously no repeatcount implies
that there is only one of the following).

	Caching is implemented by using other characters (except whitespace)
as "cache" values of the most frequently used values. This requires a two-pass
encoding scheme, but a one pass decoder.

The encoded file has a series of lines at the head seperated from the body
of the encoded test by a blank line. This is the only whitespace dependency.
Each line in the head contains three characters (plus the newline):
a character for the cached representation, and a two hex digit code that the
cache character represents. Since the caches can be anything besize 0-9,A-P
and whitespace, that leaves a lot of choices. Plus it can even be dynamic.
No caching leads right to a hex encoder(with its two for one expansion),
but with caching characters enabled, the size of the encoded output is
smaller.

An example of encoding:

The line:

This is a test of the emergency broadcasting system <52 characters>

Becomes:

<blank line>
FEGIGJHDCAGJHDCAGBCAHEGFHDHECAGPGGCAHEGIGF
CAGFGNGFHCGHGFGOGDHJCAGCHCGPGBGEGDGBHDHEGJ
GOGHCAHDHJHDHEGFGNAK				    <108 characters>

Which is like BinHex(in a way), but if I use the letters
Q-Z as caching values, the file changes to:

QCA 			<Q is caching the hex value CA>
UGB			<U is caching the hex value GB>
WGD			< Etc... >
RGF
XGH
YGI
VGJ
ZGN
SHD
THE

FEYVSQVSQUQTRSTQGPGGQTYRQRZRHCXRGOWHJQGCHC
GPUGEWUSTVGOXQSHJSTRZAK

Which is the same size (108 characters) But since the input is short,
it does not give the caching a chance to be effective.

I'll compare my encoding scheme to uuencode and report on the results.
On to gigantic programs, and since its for binary, I used my Sun 3.0
Kernel as the acid test:

-rwxr-xr-x  1 root       493853 Apr  8  1986 /pub/vmunix	<plain>
-rw-r--r--  1 barada     680443 Oct 24 14:05 vmunix.uuencoded	<uuencode>
-rw-r--r--  1 barada     670910 Oct 24 14:07 vmunix.encoded	<encoded>
	Note, The cache values used were Q-Z and a-z (36 cache values)
-rw-r--r--  1 barada     600208 Oct 24 14:31 vmunix.encoded	<encoded>
	Note, The cach values used were  Q-Za-Z and:
		!@#$%^&*()-_+=|{[}]:;<>,./	     (62 cache values)

As can be seen, my encoding scheme can produce a smaller output that
uuencode, but I don't have any checksums in my scheme.

Compressing the above encodings produce the following results:

-rw-r--r--  1 barada     387571 Oct 24 14:05 vmunix.uuencoded.Z
-rw-r--r--  1 barada     346875 Oct 24 14:07 vmunix.encoded.Z

So my encoding scheme expands a binary file by 35.85% compared to
uuencodes 37.78% which isn't much different.

What is really useful is the ability to limit the caching values to only
those that can pass through the networks, and to be almost completely
whitespace independant. It also doesn't perform so bad either.

Any comments or ideas??? 


--
Peter Barada				       | (617)-671-9905
Applicon, Inc. A division of Schlumberger Ltd. | Billerica MA, 01821

UUCP: {allegra|decvax|mit-eddie|utzoo}!linus!raybed2!applicon!barada
      {amd|bbncca|cbosgd|wjh12|ihnp4|yale}!ima!applicon!barada

	Sanity is only a state of mind.

jhs@MITRE-BEDFORD.ARPA (10/28/86)

One disadvantage to the method mentioned in the abbove-referenced message is
that each encoded byte transmits only 4 bits of information.

By comparison, uuencoded data transmits 6 bits per encoded byte, obviously
50% more for your money.

The problem with uuencode is that it uses certain troublesome characters such
as spaces to encode data.  This problem would be immediately eliminated if we
were to revise uuencode to use only non-troublesome characters.  For example,
the set a-z, A-Z, 0-9, and . and / or any other set of 64 noncontroversial
characters would do the trick.  However, it would make programming simpler if
the selected characters were in two contiguous ranges, such as a-z and
56789:;<=>?@ plus A-Z.  (I think that adds up to 64 of 'em!)

John Sangster
jhs@mitre-bedford.arpa