[comp.sys.ibm.pc] Wanted:fast text compression source

alcmist@well.UUCP (Frederick Wamsley) (09/08/89)

I'm looking for text compression code which will be used to compress blocks
of text 1K-30K in size.  It should be able to compress/decompress a 30K
block in somewhere around 10 seconds on an AT-class machine.  The text will
be source for a programming language, so there will be a lot of common 
strings to take advantage of.

Compression should be better than Huffman-coding since that's what we're 
using now.

Naturally it will have to allow for 64K segments.

All pointers to something appropriate will be gratefully received ...
-- 
Fred Wamsley  {ucbvax,pacbell,apple,hplabs}!well!alcmist;
CIS 72247,3130; GEnie FKWAMSLEY; USPS - why bother?
Have you hugged your iguana today?

rmyers@net1.ucsd.edu (Robert Myers) (09/09/89)

In article <13511@well.UUCP> alcmist@well.UUCP (Frederick Wamsley) writes:
>I'm looking for text compression code which will be used to compress blocks
> ...

Contact Bookmaster Corp. in Telluride, CO.  They have text
compression utilities that were used in the past for legal software.
As I recall, they have a generic cruncher that will compress, index,
and create a dictionary on text files with the performance you
require (30K in 10sec on AT).

I'm not sure of the price, but here's the address:

	Bookmaster Corp.
	Box 2396
	Telluride, CO 81435
	(303) 728-6412

T.K. Plummer

dmt@mtunb.ATT.COM (Dave Tutelman) (09/11/89)

In article <1958@network.ucsd.edu> rmyers@net1.UUCP (Robert Myers) writes:
>In article <13511@well.UUCP> alcmist@well.UUCP (Frederick Wamsley) writes:
>>I'm looking for text compression code which will be used to compress blocks
>> ...
>Contact Bookmaster Corp. in Telluride, CO...
>As I recall, they have a generic cruncher that will compress, index,
>and create a dictionary on text files with the performance you
>require (30K in 10sec on AT).

I may be missing something, but what's the difference between this
and the "archivers" that we all use like PKZIP, zoo, ARC (careful..)
etc?
   -	Function (sounds very similar)?
   -	Speed?
   -	Compression ratio?

I missed (and couldn't find) the base note, so maybe I've missed the key
that would enlighten me.  However, the Bookmaster reference doesn't
have anything in it that gives me a clue to the difference.

+---------------------------------------------------------------+
|    Dave Tutelman						|
|    Physical - AT&T Bell Labs  -  Middletown, NJ		|
|    Logical -  ...att!mtunb!dmt				|
|    Audible -  (201) 957 6583					|
+---------------------------------------------------------------+

rmyers@net1.ucsd.edu (Robert Myers) (09/11/89)

In article <1656@mtunb.ATT.COM> dmt@mtunb.UUCP (Dave Tutelman) writes:
>In article <1958@network.ucsd.edu> rmyers@net1.UUCP (Robert Myers) writes:
>>In article <13511@well.UUCP> alcmist@well.UUCP (Frederick Wamsley) writes:
>>>I'm looking for text compression code which will be used to compress blocks
>>> ...
>>Contact Bookmaster Corp. in Telluride, CO...
>>As I recall, they have a generic cruncher that will compress, index,
>>and create a dictionary on text files with the performance you
>>require (30K in 10sec on AT).
>
>I may be missing something, but what's the difference between this
>and the "archivers" that we all use like PKZIP, zoo, ARC (careful..)
>etc?
>   -	Function (sounds very similar)?
>   -	Speed?
>   -	Compression ratio?
>

Sorry for the lack of more specific information on the original
posting.  Essentially, Bookmaster's program (not sure of the name)
will take a text file and *compress* it.  This process involves
creating two files; one consists of a word dictionary and index, and
the other is the original text coded into one and two byte tokens
that correspond to the dictionary.  This process takes about 10 sec
or so to compress a small (30K) text file on an AT.

	FUNCTION:  The function of this program in NOT for archival
purposes.  Consider that it makes two files out of one, not one file
out of many.  Its function is for 1) text compression, 2) ultra fast
decompression of text, and 3) high speed searching.  As I said
before, this was part of a program for searching legal depositions.

	SPEED:  The initial compression is what takes the longest.
To decompress the text, the dictionary file must be loaded into RAM
and the tokens are decoded on the fly.  This allows you to decode
the words as they are read from the text token file.  I'm not sure
of any exact specifications on decompression speed, but I believe
that it is in the area of a few thousand words per second.

	COMPRESSION RATIO:  As always, this varys with the file.
Compression ratios range from 50% (worst) to 20% (best) of the size of
the original file.  Normally, (for a 50 page deposition!) the
compressed text AND dictionary (together - not separatedly)  are one
third (1/3) the size of the original file.  I know they used this
method to compress the Bible.  The size of the total text plus the
dictionary is appx. 1.2 megs.  Compressed text = 1.03 megs,
dictionary = 190K.  I believe the uncompressed text is aound 4 to 6
megabytes.  This is considerably better than an any of the
archivers I've seen, plus it can be decompressed very quickly.

Hope this has been of help.

T.K. Plummer