[comp.binaries.ibm.pc.d] Is uncompression faster than disk I/O?

nr@notecnirp.Princeton.EDU (Norman Ramsey) (01/17/89)

Someone suggested to me that it might pay off to store my data files
in compressed format, then uncompress them when I get ready to use
them.  The claim was that uncompression is faster than the associated
disk I/O.  so here's the $64 question: has anybody substantiated this
claim for IBM PC, XT, AT, or PS/2  (remember computation and I/O
speeds differ on these machines, so your mileage may vary)

In particular, how hard would it be to adapt the zoo source to do the
following: 
	f = zopen (file, pathname, "r")
		open pathname in the zoo archive file for read only
	zread(f,...)
		read from the zoo archive
	zgets(f,...)
		like fgets only compressed
	zgetc(f,...)
		like fgetc only compressed
	zeof(f,...)
		you get the idea...

If I had this facility, I might actually be able to speed up my
programs and have them use less disk space, at the same time (provided
the claim is true).  So, has anybody built a library like this?

My immediate desire is for a fast Boyer-Moore grep on files in a zoo
archive.   Does anybody have that?

Does anybody who has messed with the zoo source (or Rahul of course)
have any idea how hard it would be to twiddle zoo to do something like
this?  Obviously the space overhead would be low, since sez is just a
few Kbytes extra...

Norman Ramsey
nr@princeton.edu

dmurdoch@watdcsu.waterloo.edu (D.J. Murdoch - Statistics) (01/18/89)

In article <14227@princeton.Princeton.EDU> nr@princeton.Princeton.EDU (Norman Ramsey) writes:
>
>Someone suggested to me that it might pay off to store my data files
>in compressed format, then uncompress them when I get ready to use
>them.  The claim was that uncompression is faster than the associated
>disk I/O.  so here's the $64 question: has anybody substantiated this
>claim for IBM PC, XT, AT, or PS/2  (remember computation and I/O
>speeds differ on these machines, so your mileage may vary)

I've found this to be true in one very special case:  searching a lot of
tiny files for text.  I've got about 200 files containing 1-2K each, and
found that Borland's GREP on all the little files takes about twice as long 
as Buerg's ARCF on a big ARC file.  I'm using a 16Mhz 386 clone with a 28ms 
disk.

Duncan Murdoch

leonard@bucket.UUCP (Leonard Erickson) (01/18/89)

Sounds like what you want is a copy of looz. On of looz's options is 
"extract and execute". looz xx {zoo name} {filename} will extract the
file to *memory* and run it. 
Why reinvent the wheel?
-- 
Leonard Erickson		...!tektronix!reed!percival!bucket!leonard
CIS: [70465,203]
"I used to be a hacker. Now I'm a 'microcomputer specialist'.
You know... I'd rather be a hacker."

ttp@beta.lanl.gov (T T Phillips) (01/20/89)

In article <14227@princeton.Princeton.EDU>, nr@notecnirp.Princeton.EDU (Norman Ramsey) writes:
> 
> Someone suggested to me that it might pay off to store my data files
> in compressed format, then uncompress them when I get ready to use
> them.  ... has anybody substantiated this
> claim for IBM PC, XT, AT, or PS/2  

I have found that some times this is true, other times not.  In
working with Turbo C on a Kaypro 2000 portable with one of the
slowest 3.5" floppy drives ever created, I found that it was
several times faster to ARChive the .H files, and then extract
them to a RAM disk for use than it was to copy the individual
files over to the RAM disk.  This was a case with a lot of
little files, where the time spent accessing the directory
between reads really slowed down the file by file copy.  On the
same machine working with some of the large files like the
libraries for Turbo C, the ARC overhead slowed things down.  I
never tried options like ARCing the files in an uncompressed
format.  Incidentally, I was using a pretty old and now probably
illegal version of pkxarc.  

nanook@novavax.UUCP (Keith Dickinson) (01/23/89)

in article <14227@princeton.Princeton.EDU>, nr@notecnirp.Princeton.EDU (Norman Ramsey) says:
> Someone suggested to me that it might pay off to store my data files
> in compressed format, then uncompress them when I get ready to use
> them.  The claim was that uncompression is faster than the associated
> disk I/O.  so here's the $64 question: has anybody substantiated this
> claim for IBM PC, XT, AT, or PS/2  (remember computation and I/O
> speeds differ on these machines, so your mileage may vary)
> 
> Norman Ramsey
> nr@princeton.edu

Norman. If you were running your software off of floppy, I'd say it's possible.
But even then I'd suggest that you find some way to obtain either the quick
compress routines from PKPAK or fron Sea (Arc). 

A long time ago, someone came out with a program that would compress the
spreadsheet files for Lotus. This worked just fine for saving space, but took
for ever & a day to save.

You inherant problem is that data compression takes up more cpu cycles than
writing the file probably ever could. If your writing to a Hard disk. I'd say
that there was NO way you could compress faster than you could write.

Now with the new, high speed, compression programs. It is possible to do
minimal compression (ie packing) quickly enought that the slow down wouldn't
be noticible, but it would still be a little slower.

Keith

_   /|  | Fidonet  : 369/2 [(305) 421-8593] Brave Mew World South
\'o.O'  | Internet : nanook@muadib.FIDONET.ORG
=(___)= | UUCP     : (novavax,hoptoad!ankh,muadib)!nanook
   U    | USNail   : 433 SE 13th CT. J-202, Deerfield Beach, Fl. 33441
  Ack!  | Disclamer: This message was created by a faulty AI program.
Don't blame me...I voted for Bill'n'Opus in '88

simon@ms.uky.edu (Simon Gales) (01/23/89)

LOOZ (part of the zoo set of programs) allows you to execute programs
that are stored in a ZOO archive.  It doesn't help you access archived
data files, but if you are using floppies, you could put most of the 
DOS stuff and your utilities into an archive.

-- 
/--------------------------------------------------------------------------\
  Simon Gales@University of Ky         UUCP:   {rutgers, uunet}!ukma!simon 
  (- Insert Virus Code Here -)         Arpa:   simon@ms.uky.edu 
  MaBell: (606) 263-2285/257-3597      BitNet: simon@UKMA.BITNET  

john@stiatl.UUCP (John DeArmond) (01/23/89)

In article <929@novavax.UUCP> nanook@novavax.UUCP (Keith Dickinson) writes:
>in article <14227@princeton.Princeton.EDU>, nr@notecnirp.Princeton.EDU (Norman Ramsey) says:
>> Someone suggested to me that it might pay off to store my data files
>> in compressed format, then uncompress them when I get ready to use
>> them.  
>> 
>> Norman Ramsey
>> nr@princeton.edu
>
>Norman. If you were running your software off of floppy, I'd say it's possible.
>But even then I'd suggest that you find some way to obtain either the quick
>compress routines from PKPAK or fron Sea (Arc). 
>
>You inherant problem is that data compression takes up more cpu cycles than
>writing the file probably ever could. If your writing to a Hard disk. I'd say
>that there was NO way you could compress faster than you could write.
>
It really does not matter that data compression takes more machine 
cycles than writes.  What does matter is whether or not a given
block of data can be compressed during the interval a program would
wait for disk I/O.

A concrete example:

About a year ago a friend and I wrote a high resolution Mandelbrot map
generator.  This program calculates maps to 8 bit resolution and thus
stores a single pixel per byte.  An EGA resolution map occupies about
220k bytes.  Aside from the obvious hassles and waste in storing
and transmitting such maps, the load time into the display program
is significant.

I implemented a simple RLL-based compression.  Depending on the 
complexity of the map, the data file is reduced in size from
50% to a factor of 5 or more.

I've found that on my Compaq 386, it is MUCH faster to decompress
on the fly than to read the raw data.  The CPU is so much faster than
the I/O system that there is really no contest.  It's interesting to 
note that even with the image file totally in disk cache space,
the uncompress program run faster than the program that uses raw
pixel maps.  The difference is not large but nontheless significant.


So Norman, the answer to your question is - IT Depends!  The
general purpose algorithms such as LZW (ARC, compress, etc) and
Huffman do a pretty good job in the general sense but you can do
much better if you can exploit some characteristic of your data
set.  KISS principles apply fully here.  Many times a very 
simple algorithm will achieve a high fraction of more complicated
routines but with a vastly smaller implementation and execution
time.  So, take your compiler in one hand and an editor in the 
other and EXPERIMENT!

John

-- 
John De Armond, WD4OQC                     | "I can't drive 85!"
Sales Technologies, Inc.    Atlanta, GA    | Sammy Hagar driving 
...!gatech!stiatl!john                     | thru Atlanta!  

wcs@skep2.ATT.COM (Bill.Stewart.[ho95c]) (01/24/89)

In article <2863@stiatl.UUCP> john@stiatl.UUCP (John DeArmond) writes:
} >> Someone suggested to me that it might pay off to store my data files
} >> in compressed format, then uncompress them when I get ready to use
} I've found that on my Compaq 386, it is MUCH faster to decompress
} on the fly than to read the raw data.  The CPU is so much faster than
} the I/O system that there is really no contest.  ........
} So Norman, the answer to your question is - IT Depends!  The
	That's really the correct answer.  With the more
	sophisticated algorithms, compression is often slow,
	but decompress is table-driven and very fast - if
	you are going to generate a data-set that you read a
	lot but seldom write to, and don't need to access
	randomly from the disk, storing it compressed can be
	a big win (assuming you've got enough memory around.)
	If you're going to do a lot of both, try LZW, but
	also try simple run-length encoding (RLE) for images.

} general purpose algorithms such as LZW (ARC, compress, etc) and
} Huffman do a pretty good job in the general sense but you can do
} much better if you can exploit some characteristic of your data
} set.  KISS principles apply fully here.  Many times a very 
} simple algorithm will achieve a high fraction of more complicated
} routines but with a vastly smaller implementation and execution
} time.  So, take your compiler in one hand and an editor in the 
} other and EXPERIMENT!

	I've seldom found anything to beat LZW, even when I
	know a *lot* about the special characteristics of
	the data.  For some data (e.g. images with lots of 
	empty space) a simple RLE can be worthwhile - 
	a small fast 90% solution may be better than a big, slow
	98% one, especially on a PC.  But document your
	algorithms *very thoroughly*, and don't waste too
	much time tweaking.
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218 Holmdel NJ 201-949-0705 ho95c.att.com!wcs
#
#	News.  Don't talk to me about News.