[comp.compression] Trying to get maximum compression

force@aix01.aix.rpi.edu (Christopher Jon Pe) (03/25/91)

shaw@pegasus.com (Sandy Shaw) writes:

>I am trying to get the maximum compression possible of the following
>32 byte hex file(in ASC): 

>f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
>0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec

>The best results I get are using the compact utility(adaptive Huffman). This
>gives 21.88% compression or 7 bytes saved. Does anyone out there have a 
>scheme that can improve on this? 
  Well, if it is a HEX file, why don't you try to compress it by taking two
bytes and putting them into one byte (range 0-255)?

-- 
Gently, he stood there, above it all.  Only when he turned his steely-blue
eyes upwards to the stars did he see the subtle purity that lay within the
cosmos.  Finally, he was at peace, for he realized that, although he was
parted from his love by 300 miles, they both saw the same stars.

callahan@cs.jhu.edu (Paul Callahan) (03/25/91)

In article <1991Mar24.152106.6333@pegasus.com> shaw@pegasus.com (Sandy Shaw) writes:
>I am trying to get the maximum compression possible of the following
>32 byte hex file(in ASC): 
>
>f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
>0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec

I suspect that this isn't what you're looking for, but I figure
SOMEBODY's got to say it.

How about the following compression scheme: 

The bit string "0" represents the 32 byte sequence:

    f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
    0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec

Any bit string beginning with a "1" represents the sequence of bits
following the 1.

Note:  This is a compression scheme to the extent that it allows all bit
strings to be represented.  It also gives maximum compression for the given
hex file, unless one is allowed to encode it as the empty string.

It may seem that I'm just making a wisecrack, but I think there is a valid
point here.  That is, it's not immediately clear how one may ask this sort
of compression question about a single string (as opposed to infinite families
of strings) without leading to the trivial solution I have outlined above.

Perhaps you are looking for the notion of Kolmogorov complexity, roughly the 
length of the shortest program that can produce a given string.  For a single
string, even this question can only be posed in an interesting manner for
a given computer or Turing machine (i.e. what's to stop me from assuming my
programming language has a primitive "!" that prints the above string?).

--
Paul Callahan
callahan@cs.jhu.edu

brad@looking.on.ca (Brad Templeton) (03/25/91)

I don't understand the problem of trying to get maximum compression of
a specific 32 byte file.   The problem of compressing a specific file of
32 bytes makes no sense.  You either want to compress a general class of
32 byte files, because you have lots of them (in which case the answer may
be to compress them in groups) or a very large specific file.

Since any compression code will be more than 32 bytes long, the far simpler
answer is just to store the file rather than the decompressor.

Compression involves removing redundant information from a file.  A 32 byte
file by definition contains no more than 32 bytes of redundant information!

If you want to phrase your question, "how do I compress 32 byte files that
have the following properties?" then people can answer it.  Of course the
properties are needed, since if the file is random bits, it can't be compressed
in the general case.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

vd09+@andrew.cmu.edu (Vincent M. Del Vecchio) (03/25/91)

Is there any point to this exercise?  I suggest the following compression
algorithm:
If input is f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec 0cbb 8bec 5cdb ecdb 5c9c
bbec 8bdb 9cec, output a "1" bit.  Otherwise, output a "0" bit and append the
results of compression with your favorite compression algorithm.  This will, of
course, result in a theoretical compression ratio of 256:1, but there are a
bunch of problems, too...

-Vincent Del Vecchio
vd09@andrew.cmu.edu

raymond@cosc.canterbury.ac.nz (cantva) (03/25/91)

From article <obvFxEe00UgK01dIQG@andrew.cmu.edu>, by vd09+@andrew.cmu.edu (Vincent M. Del Vecchio):
> If input is f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec 0cbb 8bec 5cdb ecdb 5c9c
> bbec 8bdb 9cec, output a "1" bit.  Otherwise, output a "0" bit and append the
> results of compression with your favorite compression algorithm.  This will, of
> course, result in a theoretical compression ratio of 256:1, but there are a
> bunch of problems, too...

Like having to store the string (in some place or other) anyway. This is fine
if you are just transmitting the string somewhere and you wish to minimise
what you are sending but if you want to archive it this wont work.









--
Raymond Wilson.	email:	raymond@cosc.canterbury.ac.nz
		snail:	c/- Computer Science Department,
			University of Canterbury,
			New Zealand.

dik@cwi.nl (Dik T. Winter) (03/25/91)

 > shaw@pegasus.com (Sandy Shaw) writes:
 > 
 > >I am trying to get the maximum compression possible of the following
 > >32 byte hex file(in ASC): 
 > 
 > >f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
 > >0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec
 > 
 > >The best results I get are using the compact utility(adaptive Huffman). This
 > >gives 21.88% compression or 7 bytes saved. Does anyone out there have a 
 > >scheme that can improve on this? 
 >   Well, if it is a HEX file, why don't you try to compress it by taking two
 > bytes and putting them into one byte (range 0-255)?
But even if those are just 32 bytes, you can compress it to zero bytes with a
general compression/decompression program (specialized for this case of course).
(E.g. if the compressed file is empty the original were those 32 bytes; if
the (only) byte is 0, the original was empty, otherwise it is output from
compress.  This expands slightly on empty files of course.)
But I do not think that was the intention.  So what is the real problem?
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

dvldbg@cs.umu.se (Daniel Brahneborg) (03/25/91)

In article <a+5fs#b@rpi.edu> force@aix01.aix.rpi.edu (Christopher Jon Pe) writes:
>shaw@pegasus.com (Sandy Shaw) writes:
>
>>I am trying to get the maximum compression possible of the following
>>32 byte hex file(in ASC): 
>
>>f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
>>0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec
>
>  Well, if it is a HEX file, why don't you try to compress it by taking two
>bytes and putting them into one byte (range 0-255)?
>

Get serious! He said it was a 32-byte HEX file, not a 80-byte ASCII file.
Where were you during the lessons about bits and bytes?

/Basic

anthony@convex.csd.uwm.edu (Anthony J Stieber) (03/25/91)

In article <1991Mar24.152106.6333@pegasus.com> shaw@pegasus.com (Sandy Shaw) writes:
>I am trying to get the maximum compression possible of the following
>32 byte hex file(in ASC): 
>
>f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
>0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec
>

First do some simple analysis of the data.  There are 32 bytes, of
these there are 11 unique bytes.  Replace each data byte with a 4 bit
pointer to one of the 11 bytes.  Now instead of 256 (32*8) bits
uncompressed data, or 200 (25) Hufmann compressed bits, there are 216
(11*8+32*4) bits, which isn't as good.  Now take the table and compute
the differences between the 11 bytes, starting with zero.  There are 11
differences, each difference is less than 64 so they can be stored as 6
bit values. Thus the table can be compressed from 88 (11*8) bits down
to 66 (11*6).  Total bits is 194 (128+66).

This compression is of course totally dependant on the distribution of
the data, and is probably worthless in real life situations.
Interestingly enough there are 11 unique bytes and 11 unique nybbles.
As it happens even if there were only 4 unique nybbles that would still
take 128 (64*2) bits to encode.  Thus a nybble table would not
generally be more efficient.

However, what's the point?  The code to do any sort of uncompression like
this is going to be much bigger than the data.  What are you actually
trying to do?
--
<-:(= Anthony Stieber	anthony@csd4.csd.uwm.edu   uwm!uwmcsd4!anthony

warwick@cs.uq.oz.au (Warwick Allison) (03/25/91)

In <1991Mar25.002730.11027@cs.umu.se> dvldbg@cs.umu.se (Daniel Brahneborg) writes:

>>  Well, if it is a HEX file, why don't you try to compress it by taking two
>>bytes and putting them into one byte (range 0-255)?
>>
>Get serious! He said it was a 32-byte HEX file, not a 80-byte ASCII file.
>Where were you during the lessons about bits and bytes?

click.

click.

click.  WOOOOOOOOMPH.
Flame on.

Gee, this group is off to a good start.

What make the term "hex file" mean "binary file" ?  A "hex file" isn't anything
`defined in the literature'.  I think the respondant was more likely to be right
in interpretting the stupid question in a way which is at least slightly worth
answering.  Personally, I just saw the question and brushed it off.  Whoever
posted the question did not understand what they were asking, so it's no use
responding via news.  e-mail them and explain that the question is a pile of
poop.  As it turned out, we have prompted quite some discussion, and I guess
we have established ground level information.

Anyway, I'm off to ask a real question...

FFFFFFFFFFfffffffffizzzzzzzssss s   ss s... POP!
Flame off.

Warwick.

--
  _--_|\   	warwick@cs.uq.oz.au
 /      *  <--	Computer Science Department,
 \_.--._/ 	University of Queensland,
       v    	AUSTRALIA.

gwyn@smoke.brl.mil (Doug Gwyn) (03/25/91)

In article <1991Mar24.152106.6333@pegasus.com> shaw@pegasus.com (Sandy Shaw) writes:
>I am trying to get the maximum compression possible of the following
>32 byte hex file(in ASC): 

Something I don't understand here -- surely it takes ZERO bits to represent
a sequence that you already know in advance?

philip@beeblebrox.dle.dg.com (Philip Gladstone) (03/26/91)

>>>>> On 24 Mar 91 15:21:06 GMT, shaw@pegasus.com (Sandy Shaw) said:

Sandy> I am trying to get the maximum compression possible of the following
Sandy> 32 byte hex file(in ASC): 

Sandy> f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
Sandy> 0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec

Sandy> The best results I get are using the compact utility(adaptive Huffman). This
Sandy> gives 21.88% compression or 7 bytes saved. Does anyone out there have a 
Sandy> scheme that can improve on this? 

Yes:

        I can compress this file down to 1 byte (one bit really) with the following
algorithm:

        If file data is equal to "f3e9 .... 9cec" then output 0
        else output 1 followed by file data.

The expansion function is obvious.

Another algorithm that I could use for the above data is:

        if next byte is 'ec' then output 1 else output (0 followed by byte).
        
This results in 32 control bits + 20 bytes of data. This is 24 bytes
-- a saving of 8 bytes. 

I suspect that this is not the sort of answer that you wanted. You
need to be more precise about the other data you want to compress. For
example, compression of the following data:

1 78 123 511 985 ...
or
3 45 67 999 ...

can be done in a number of ways. The information that the format of
the data is a sequence of non-decreasing integers seperated by
whitespace allows you to use a better compression algorithm. In many
applications for compression, a great deal is known about the data
before you start. This information should be used in designing the
algorithm. The various programs like compress, pack, arc, etc all make
assumptions about the data they are working on. When the assumptions
are incorrect, they act as expansion programs!

Philip
--
Philip Gladstone         Dev Lab Europe, Data General, Cambridge, UK

    Listen three eyes, don't you try and outweird me, I get
    stranger things than you free with my breakfast cereal.

davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (03/27/91)

I got the same 25 byte size with a compressor I'm playing with, which
gives very good results on small files. What params did you feed compact
to get the results? I tried to duplicate it and couldn't.

Or have you haved compact?
-- 
bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen)
    sysop *IX BBS and Public Access UNIX
    moderator of comp.binaries.ibm.pc and 80386 mailing list
"Stupidity, like virtue, is its own reward" -me

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/27/91)

In article <PHILIP.91Mar25135706@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:
>>>>>> On 24 Mar 91 15:21:06 GMT, shaw@pegasus.com (Sandy Shaw) said:
>Sandy> I am trying to get the maximum compression possible of the following
>Sandy> 32 byte hex file(in ASC): 
>Sandy> f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
>Sandy> 0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec
>Sandy> The best results I get are using the compact utility(adaptive Huffman). This
>Sandy> gives 21.88% compression or 7 bytes saved. Does anyone out there have a 
>Sandy> scheme that can improve on this? 
>Yes:
>I can compress this file down to 1 byte (one bit really) with the following
>algorithm ...

While I must have reservations similar to Philip regarding coming up
with an encoding of a _single example_ rather than a _class_ of strings, it 
_is_ a fun intellectual exercise. To get the ball rolling (a bit
faster), I've managed a compression of 30% using the appended program 
(provided you don't count the program as part of the encoding! :-).
That is, almost 10 bytes are saved (9.4 to be exact).

Tony Warnock at lanl has pointed out to me there are algorithms that turn 
bitstrings into shift-register machines. Essentially such machines require a 
very small number of parameters and might prove useful.

-kym

BTW, LZ can't compress the output file further, although this would seem
simple to do.

====================program====================
#include <assert.h>
#include <stdio.h>

int bytes[32];

int nbr;
int nbw;
int pos;
int lastpos;

outnum(x) {
	putchar('0'+(x&1)), nbw++; x>>=1; 
	putchar('0'+(x&1)), nbw++; x>>=1;
	if(x) putchar('0'+(x&1)), nbw++;
	}

outbit(b){
	if(b&1) outnum(pos-lastpos),lastpos=pos+1;
	++pos;
	}

main(){
	int i,j;
	nbr=0;
	nbw=0;
	pos=0;
	lastpos=0;
	for(i=0;i<32;i++)scanf("%2x",&bytes[i]),nbr+=8;
	for(i=0;i<32;i++)printf("%x ",bytes[i]);
	putchar('\n');
	for(i=0;i<32;i++){
		unsigned diff = bytes[i]^0xec;
		outbit(diff); diff>>=1;
		outbit(diff); diff>>=1;
		outbit(diff); diff>>=1;
		outbit(diff); diff>>=1;
		outbit(diff); diff>>=1;
		outbit(diff); diff>>=1;
		outbit(diff); diff>>=1;
		outbit(diff);
		}
	putchar('\n');
	printf("%d bits read; %d bits written; compression=%g\n",
		nbr,nbw,(double)nbw/nbr
		);
	exit(0);
	}
====================log file====================
f3 e9 ec 5c 8b ec ec db ec e9 ec 12 ec 3f ec ec c bb 8b ec 5c db ec db 5c 9c bb ec 8b db 9c ec 
000000000011101000010000000010010000001000010100110000000000000000001100010100000000001010100000010010100100000001000010000010000110010001000010000010101000000010010000010000110000
256 bits read; 180 bits written; compression=0.703125
====================end end end====================

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/27/91)

In article <1991Mar27.132216.8132@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>I've managed a compression of 30% using the appended program 
>(provided you don't count the program as part of the encoding! :-).
>That is, almost 10 bytes are saved (9.4 to be exact).

BTW, my little formula indicates a compression of 56% (i.e. save almost 18 
bytes) may be possible since p=.33 approximately.

-kym

doctor@vlsi.polymtl.ca (doctor zero) (03/29/91)

In article <1991Mar24.152106.6333@pegasus.com> shaw@pegasus.com (Sandy Shaw) writes:
>I am trying to get the maximum compression possible of the following
>32 byte hex file(in ASC): 
>
>f3e9 ec5c 8bec ecdb ece9 ec12 ec3f ecec
>0cbb 8bec 5cdb ecdb 5c9c bbec 8bdb 9cec
>
>The best results I get are using the compact utility(adaptive Huffman). This
>gives 21.88% compression or 7 bytes saved. Does anyone out there have a 
>scheme that can improve on this? 
>
>Sandy Shaw
>shaw@pegasus.com


	one way that *could* increase your compression is making your
data 'more compressible'.  Try xoring your bytes in a daisy chain
way (first byte xor second byte -> first output byte) and then 
compress THAT output.

	this kind of pre-processing is at its best when you are trying
to compress a checkerboard image.  By XORing successive lines, you
ultimately get a white  picture plus something at the last line.

I think so.. correct me if I'm wrong.

			paul
--
|						| Everybody, everybody
| doctor@info.polymtl.ca			| I don't know anobody else
| "De toute facon, on est les meilleurs!"	| Ride on time
|						| Strike it up - Black Box !