[net.dcom] Squeezing files.

ken@turtlevax.UUCP (Ken Turkowski) (06/19/85)

In article <1861@ukma.UUCP> sean@ukma.UUCP (Sean Casey) writes:
>In article <784@turtlevax.UUCP> ken@turtlevax.UUCP (Ken Turkowski) writes:
>>I think you should consider changing to Lempel-Ziv Compression (posted
>>to the net as "compress", version 3.0), which normally gives 70%
>>compression (30% of original size) to text.  The program is fast, and
>>adapts to whatever type of data you give it, unlike static Huffman
>>coding.  It usually produces 90% (!) compression on binary images.
>
>WHOA BUDDY!
>
>Lempel-Ziv doesn't do NEARLY that well.  We've been using it for
>months, and we've found that text and program sources usually get about
>55-65% compression, while binaries get about 45-55% compression.  This
>is encountered in the optimal case of compressing a large archive of
>files.  As files get smaller, expecially as they drop below about 8k in
>size, compression worsens. I seriously doubt that most binaries contain
>only 10% of unambiguous information, much less being compressable to
>that size.

I can see that we have a semantic problem here.  By "image", I mean a
picture, or two-dimensional signal.  By "binary", I mean ones and
zeros, black and white, no grey-scale, no color.  A binary image is
then a coarsely quantized picture, with lots of runs of zeros and
ones.  L-Z does exceptionally well on these type of data, and I will
reiterate my claim of 90% average compression.

As far a program source code and executable machine code, I get the
same types of compression ratios as you.

I'm curious; what is the etymology of the word "binary" as it is
sometimes used to refer to executable machine code?  And why does it
imply program rather than data?
-- 

Ken Turkowski @ CADLINC, Menlo Park, CA
UUCP: {amd,decwrl,hplabs,nsc,seismo,spar}!turtlevax!ken
ARPA: turtlevax!ken@DECWRL.ARPA

zben@umd5.UUCP (06/21/85)

I added net.nlang to the group header, because we are getting into that area,
and because I though the nlang people might be interested in this discussion.

In article <789@turtlevax.UUCP> ken@turtlevax.UUCP (Ken Turkowski) writes:
>In article <1861@ukma.UUCP> sean@ukma.UUCP (Sean Casey) writes:
>>In article <784@turtlevax.UUCP> ken@turtlevax.UUCP (Ken Turkowski) writes:
>>>I think you should consider changing to Lempel-Ziv Compression (posted
>>>to the net as "compress", version 3.0), which normally gives 70%
>>>compression (30% of original size) to text.  The program is fast, and
>>>adapts to whatever type of data you give it, unlike static Huffman
>>>coding.  It usually produces 90% (!) compression on binary images.
>>
>>Lempel-Ziv doesn't do NEARLY that well.  We've been using it for
>>months, and we've found that text and program sources usually get about
>>55-65% compression, while binaries get about 45-55% compression.  
>
>I can see that we have a semantic problem here.  By "image", I mean a
>picture, or two-dimensional signal.  By "binary", I mean ones and
>zeros, black and white, no grey-scale, no color.  
>
>I'm curious; what is the etymology of the word "binary" as it is
>sometimes used to refer to executable machine code?  And why does it
>imply program rather than data?

I remember way back when IBM was the only game in town, they called the 
output decks produced by compilers "relocatable binaries".  The Univac  
system I grew up on has both "relocatable elements" and "absolute elements",
the latter sort of like "load modules" on current IBM systems, programs
linked and ready-to-run, but incapable of further modification.

So, Univac dropped the "binary" part.  It seems another branch in the
etymology of these beasts dropped the "relocatable" and just ended up with
"binary", on many systems there is not an "absolute" form, so the distinction
was not needed.

Now, "image", to my mind, implies something else entirely.  It implies a
strict one-for-one correspondance between words in-core and words in the file.
By this definition, neither the Univac (very tightly-packed format) nor the
usual Unix (because of BSS) implementations apply.  I understand on the old
TOPS-10 system a running program could write a copy of itself out to the file
system, which could then later be executed and pick up where it had started.
THIS qualifies as an "image".

Any takers?

-- 
Ben Cranston  ...{seismo!umcp-cs,ihnp4!rlgvax}!cvl!umd5!zben  zben@umd2.ARPA

jeff@rtech.UUCP (Jeff Lichtman) (06/28/85)

> >
> >I'm curious; what is the etymology of the word "binary" as it is
> >sometimes used to refer to executable machine code?  And why does it
> >imply program rather than data?
> 

Here's my guess.  There are many ways to represent numbers (as we all should
know).  Binary is a format that people have trouble reading, and strings of
the characters 0-9 are easy to read.  I believe that, by extension, the word
"binary" is applied to any non-human-readable data, especially when it is
stored in files.  The human-readable and non-human-readable forms of programs
(source and object or executable code) parallel the human-readable and
non-human-readable forms of numbers, so its easy to draw an analogy.
-- 
Jeff Lichtman at rtech (Relational Technology, Inc.)
aka Swazoo Koolak

{amdahl, sun}!rtech!jeff
{ucbvax, decvax}!mtxinu!rtech!jeff

msp@ukc.UUCP (M.S.Parsons) (06/28/85)

In article <789@turtlevax.UUCP> ken@turtlevax.UUCP (Ken Turkowski) writes:
>>..
>>Lempel-Ziv doesn't do NEARLY that well.  We've been using it for
>>months, and we've found that text and program sources usually get about
>>55-65% compression, while binaries get about 45-55% compression.
>>..
>I can see that we have a semantic problem here.  By "image", I mean a
>picture, or two-dimensional signal.  By "binary", I mean ones and
>zeros, black and white, no grey-scale, no color.
>..
>I reiterate my claim of 90% average compression.
>..

I agree with Ken: compress works brilliantly with binary IMAGES. It is
certainly better than UCB compact. What's interesting is that it works 
well with the image as a raster, run-length or quadtree: the underlying 
structure seems to make little difference.
--Mike.

mac@uvacs.UUCP (Alex Colvin) (06/30/85)

 >
 >I'm curious; what is the etymology of the word "binary" as it is
 >sometimes used to refer to executable machine code?  And why does it
 >imply program rather than data?

Probably because everything was stored on Hollerith cards (1 character/column),
except for code, which was stored on binary decks (12 bits/column).