[comp.sources.wanted] Need information on data compression algorithms

jsm@vax1.UUCP (03/25/87)

Needed : information on data compression techniques and algorithms, such as
Huffman coding  and Lempel-Zev, that are used in programs such as SQ and
ARC. I need something that is fast enough to be transparent to the user in
an interactive system.

I don't know anything about the topic, and would appreciate suggestions of
good, comprehensive textbooks containing algorithms.

A few weeks ago a version of ARC was posted to either net.sources or
mod.sources (source code for UNIX*   , I believe). If anyone saved this
posting, I would appreciate a copy.

Thank you ... please email responses ...


Jon Meltzer
Dept. of Modern Languages and Linguistics, Cornell University


*UNIX is a trademark of you-know-who.

akk2@ur-tut.UUCP (03/27/87)

Two references that you could look at :

1. Welch, Terry A., "A Technique for High Performance Data 
   Compression", IEEE Computer, Vol. 17, June 1984.

2. Huffman, D., "A Method for Constructing Minimum Redundancy
   Codes", Proceedings of the Institute of Radio Engineers,
   Vol. 40, May 1952, pp 1098-1101.

There are a couple of other good refs. that escape me right now.


-- 
-----------------------
Atul Kacker
UUCP: ...seismo!rochester!ur-tut!akk2

eppstein@tom.columbia.edu.UUCP (03/27/87)

Another good reference that is perhaps less well known:

Victor S. Miller and Mark N. Wegman, Variations on a theme by Ziv and Lempel,
in Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds),
Springer-Verlag, 1985.

References to a number of related works can be found in this paper.
-- 
David Eppstein, eppstein@cs.columbia.edu, Columbia U. Computer Science Dept.

mct@praxis.UUCP (03/30/87)

Reply-To:mct%praxis.uucp@ukc.ac.uk

Once apon a time, about five years ago, a UK company called DJAI, (the
authors of The Last One - a 4GL program generator) said that they had an
algorithm which would (reversibly) compress arbitrary text into 40-80 bytes.
They even had a demonstration system which claimed to show the algorithm in
operation.  

They planned to demonstrate the system by compressing the complete source
text of The Last One, copying the <80 bytes to a hand-held micro with a
printer, then decompressing and producing the total listing from the micro.

I expect the algorithm has been suppressed by the UK MoD :-)

rolf@warwick.UUCP (03/31/87)

Another reference:

  Shannon, C.E.  "Prediction and Entropy of Printed English",
  Bell Sys. Tech. J.  vol.30, pp.50-64, Jan 1951.

If you get many other replies perhaps you'd like to post a summary?

-- 
Rolf

Dept. of Computer Science,	Tel:	+44 203 523523 ext 2485
Warwick University,		JANET:	rolf@uk.ac.warwick.flame
Coventry,			UUCP:	{seismo,mcvax}!ukc!warwick!rolf
CV4 7AL
England.			"Three pints? At lunchtime?!"

jsm@vax1.UUCP (04/01/87)

In article <512@ubu.warwick.UUCP> rolf@warwick.UUCP (Rolf Howarth) writes:
>Another reference:
>
>  Shannon, C.E.  "Prediction and Entropy of Printed English",
>  Bell Sys. Tech. J.  vol.30, pp.50-64, Jan 1951.
>
>If you get many other replies perhaps you'd like to post a summary?
>
Stuff is still coming in (thanks, everyone!) ,and I haven't had time to
check all the references out. Give me about two weeks.

keeshu@nikhefk.UUCP (04/03/87)

In article <327@vax1.ccs.cornell.edu> jsm@vax1.UUCP (Jon Meltzer) writes:
>Stuff is still coming in (thanks, everyone!) ,and I haven't had time to
>check all the references out. Give me about two weeks.

another reference :

White, Ronald G. "Compressing Image Data With Quadtrees"
		 Dr. Dobb's Journal, march 1987 page 16 ->


-- Kees

|  UUCP	  : keeshu@nikhefk.uucp  or : {[wherever]!seismo}!mcvax!nikhefk!keeshu
|  FIDO   : kees huyser at 28/9 or 500/11
|  BITNET : u00212@hasara5.bitnet
|  SNAIL  : kees huyser, NIKHEF-K, PO Box 4395, 1009 AJ Amsterdam, Netherlands

dick@zaphod.UUCP (04/03/87)

Summary:
Expires:
Sender:
Followup-To:

In article <512@ubu.warwick.UUCP> rolf@warwick.UUCP (Rolf Howarth) writes:
>Another reference:
>[...]
>If you get many other replies perhaps you'd like to post a summary?

I would very much appreciate a summary being posted.  Thanks very 
much in advance. . . .
-- 
Dick Flanagan, W6OLD
UUCP:  ...!ucbvax!decwrl!sun!plx!dick
GEnie: C.FLANAGAN
(The usual disclaimers apply)

john@viper.UUCP (04/05/87)

In article <383@newton.praxis.co.uk> mct%praxis.uucp@ukc.ac.uk writes:
 >Once apon a time, about five years ago, a UK company called DJAI, (the
 >authors of The Last One - a 4GL program generator) said that they had an
 >algorithm which would (reversibly) compress arbitrary text into 40-80 bytes.
 >They even had a demonstration system which claimed to show the algorithm in
 >operation.  
 >
 >They planned to demonstrate the system by compressing the complete source
 >text of The Last One, copying the <80 bytes to a hand-held micro with a
 >printer, then decompressing and producing the total listing from the micro.
 >
 >I expect the algorithm has been suppressed by the UK MoD :-)

  This is a facinating story...  Anyone else ever hear about this?
  Anyone know -anything- (or have any guesses) about the algorithm?

--- 
John Stanley (john@viper.UUCP)
Software Consultant - DynaSoft Systems
UUCP: ...{amdahl,ihnp4,rutgers}!{meccts,dayton}!viper!john

pitaro@savax.UUCP (04/05/87)

>> Once apon a time, about five years ago, a UK company called DJAI, (the
>> authors of The Last One - a 4GL program generator) said that they had an
>> algorithm which would (reversibly) compress arbitrary text into 40-80 bytes.
>> They even had a demonstration system which claimed to show the algorithm in
>> operation.  
>>
>> They planned to demonstrate the system by compressing the complete source
>> text of The Last One, copying the <80 bytes to a hand-held micro with a
>> printer, then decompressing and producing the total listing from the micro.
>>
>> I expect the algorithm has been suppressed by the UK MoD :-)

 > This is a facinating story...  Anyone else ever hear about this?
 > Anyone know -anything- (or have any guesses) about the algorithm?

My guess is that it only handles certain text, ie those in it's
list.  If the text is indexed in the list then it prints it out
otherwise it stops with an error.  The "arbitrary text" this algorithm
compresses is arbitrary as long as it's in the database.  Or maybe
it's a hoax.  What was the date on the original article? 4-1-87 maybe :-)

	Michael Pitaro

USMail:	Sanders Associates		UUCP:	decvax!savax!pitaro
	MER24-1583C			PHONE:	+1 (603) 885-9036
	CS 2034				HOME:	46-D Hampshire Drive
	Nashua, NH 03063-2034			Nashua, NH 03063

lambert@mcvax.UUCP (04/06/87)

]>> Once apon a time, about five years ago, a UK company called DJAI, (the
]>> authors of The Last One - a 4GL program generator) said that they had an
]>> algorithm which would (reversibly) compress arbitrary text into 40-80 bytes.

]> This is a facinating story...  Anyone else ever hear about this?

] Or maybe it's a hoax.  What was the date on the original article? 4-1-87
] maybe :-)

No, Mar-30-87.  But think: when was "about five years ago"?

-- 

Lambert Meertens, CWI, Amsterdam; lambert@cwi.nl

lindsay@cheviot.UUCP (04/07/87)

In article <528@savax.UUCP> pitaro@savax.UUCP (Michael Pitaro) writes:
>compresses is arbitrary as long as it's in the database.  Or maybe
>it's a hoax.  What was the date on the original article? 4-1-87 maybe :-)

As far as I remember it wasnt a hoax but a con and I dont think it was
to do with the Last One people either. I recollect that someone was
touring the country charging 500 pounds for a seminar on this technique.
Several people paid him "just in case". The man I am thinking of
subsequently vanished with a large sum of money.....

Lindsay

-- 
Lindsay F. Marshall, Computing Lab., U of Newcastle upon Tyne, Tyne & Wear, UK
JANET: lindsay@uk.ac.newcastle.cheviot  ARPA: lindsay%cheviot.newcastle@ucl-cs
PHONE: +44-91-2329233                   UUCP: <UK>!ukc!cheviot!lindsay
"How can you be in two places at once when you're not anywhere at all?"

hollombe@ttidca.UUCP (04/10/87)

>> Once apon a time, about five years ago, a UK company called DJAI, (the
>> authors of The Last One - a 4GL program generator) said that they had an
>> algorithm which would (reversibly) compress arbitrary text into 40-80 bytes.
>> They even had a demonstration system which claimed to show the algorithm in
>> operation.
>>
>> They planned to demonstrate the system by compressing the complete source
>> text of The Last One, copying the <80 bytes to a hand-held micro with a
>> printer, then decompressing and producing the total listing from the micro.

> Anyone know -anything- (or have any guesses) about the algorithm?

There are algorithms that can do this, though whether they could be
implemented on a micro is problematical.  The one that comes to mind is
converting the arbitrary text into one _very_ large number and then
representing that number as a combination of powers of smaller numbers.

E.g.:  1234^5678 + 456^123 + 90^20 - 8^2 - 3

Note the above amount probably couldn't be contained in the known universe
if you attempted to write it out in full, but the above representation is
highly compact.  Sorry, I'm not aware of the details of how the text
conversion and restoration is done.  I think Kurt Goedel (as in _Goedel,
Escher, Bach: an Eternal Golden Braid_) first proposed the technique
(occasionally referred to as "Goedelizing").

-- 
The Polymath (aka: Jerry Hollombe, hollombe@TTI.COM)
Citicorp(+)TTI
3100 Ocean Park Blvd.   (213) 450-9111, x2483
Santa Monica, CA  90405 {csun|philabs|psivax|trwrb}!ttidca!hollombe

pdg@ihdev.UUCP (04/13/87)

In article <4542@columbia.UUCP> metzger@garfield.columbia.edu.UUCP (Perry Metzger) writes:
>Look folks, as we all know from being cryptographers (you do actually
>know a bit about the subject, don't you?) that the redundancy of
>english is NEVER estimated by anyone in his right mind at more than
>75%, so thus it is impossible to compress it beyond that in general.

Unless you take symbolic representation into account.  Say you
numbered the (2^14)-1 words, and used that to replace just the
absolute matches (forgetting about endings etc), you would already be
elimintating more than you claim, before even implementing redundancy
techniques.  If you read a little more on the subject, you would
realize there is more than one way to skin a cat.

-- 

Paul Guthrie
ihnp4!ihdev!pdg			This Brain left intentionally blank.

metzger@garfield.columbia.edu.UUCP (04/16/87)

In article <1323@ihdev.ATT.COM> pdg@ihdev.UUCP (Joe Isuzu) writes:
>Unless you take symbolic representation into account.  Say you
>numbered the (2^14)-1 words, and used that to replace just the
>absolute matches (forgetting about endings etc), you would already be
>elimintating more than you claim, before even implementing redundancy
>techniques.  If you read a little more on the subject, you would
>realize there is more than one way to skin a cat.

Sorry, you still lose. If you do that, you are not compressing an
ARBITRARY text that much. For instance, if I have 256 books in my
library, and each one of them contains say a million words (I know
that is big) you can then compress any book in the library down to 8
bits, provided that you know that the text you are transmitting is in
the library. But then, you see, you are not compressing an ARBITRARY
text, only a text you can find in that limited library. If I pick
completely random text I can't do that.

Enough of this.

.pm

rab@well.UUCP (04/22/87)

In a previous article Perry Metzger writes:
+  Joe Isuzu writes:
+>Unless you take symbolic representation into account.  Say you
+>numbered the (2^14)-1 words, and used that to replace just the
+>absolute matches (forgetting about endings etc), you would already be
+>elimintating more than you claim, before even implementing redundancy
+>techniques.  If you read a little more on the subject, you would
+>realize there is more than one way to skin a cat.

+Sorry, you still lose. If you do that, you are not compressing an
+ARBITRARY text that much. For instance, if I have 256 books in my
+library, and each one of them contains say a million words (I know
+that is big) you can then compress any book in the library down to 8
+bits, provided that you know that the text you are transmitting is in
+the library. But then, you see, you are not compressing an ARBITRARY
+text, only a text you can find in that limited library. If I pick
+completely random text I can't do that.

  The subject was English text.  Last time I checked, English was not
completely arbitrary.  Illogical, perhaps, but not *completely* arbitrary.
The above described technique would work fairly well on most text,
although I'm not sure whether it would allow you to break the 75% barrier
you previously referred to.  Maybe.


-- 
Robert Bickford         {hplabs, ucbvax, lll-lcc, ptsfa}!well!rab
terrorist cryptography DES drugs cipher secret decode NSA CIA NRO IRS
coke crack pot LSD russian missile atom nuclear assassinate libyan RSA

mouse@mcgill-vision.UUCP (der Mouse) (04/26/87)

In article <1323@ihdev.ATT.COM>, pdg@ihdev.ATT.COM (Joe Isuzu) writes:
> In article <4542@columbia.UUCP> metzger@garfield.columbia.edu.UUCP (Perry Metzger) writes:
>> Look folks, as we all know from being cryptographers [...] that the
>> redundancy of english is NEVER estimated [...] at more than 75%, so
>> thus it is impossible to compress it beyond that in general.

> Unless you take symbolic representation into account.  Say you
> numbered the (2^14)-1 words, and used that to replace just the
> absolute matches (forgetting about endings etc),

What you appear to be suggesting is not just compressing English text
but representing it at a higher level.  If you "compress" beyond the
point of discarding all redundancy you are losing information; in other
words, there will be distinct input texts which compress to the same
compressed text, which means it will be impossible to correctly
regenerate both of them from the compressed text.  For example, in
English there are often two (or more) ways to represent the same
concept.  We are not interested in merely storing ideas (or if we are,
we should be explicit in admitting it), we want to compress text and
then uncompress to the exact same sequence of bytes.

Alternatively, your posting could be taken to mean that you believe
there is more than 75% redundancy in English.  In the case of Usenet
postings I fear you are correct - maybe your software would improve
netnews performance! -- what's that?  You don't have this implemented?
Then how do you know it works?  Go ahead and try it; I suspect you will
find it doesn't compress by as much as you think it will.

					der Mouse

				(mouse@mcgill-vision.uucp)