[comp.binaries.ibm.pc.d] Today's lzhuf tests

davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) (04/13/89)

  I have often noted the poor performance of compress on uuencoded
files when sending news. I therefore took an arc file, converted it to
text via uuencode, and applied compress and lzhuf to it. I repeated the
test using the more efficient btoa routine.


  Size  Modify time      File name
  7028   6 Apr 89 13:38  lzhsrc10.arc	The original archive file

  9715  13 Apr 89 11:43  arcuu		arc->uuencode
  7502  13 Apr 89 11:44  arcuu.L	arc->uuencode->lzhuf
  9523  13 Apr 89 11:44  arcuu.Z	arc->uuencode->compress

  8956  13 Apr 89 11:45  arcb2a		arc->btoa
  7472  13 Apr 89 11:46  arcb2a.L	arc->btoa->lzhuf
 10007  13 Apr 89 11:46  arcb2a.Z	arc->btoa->compress

Conclusions:
 1) lzhuf performs better than compress on uuencoded files
 2) lzhuf performs far better than compress on btoa'd files
 3) since most site compress news before sending, uuencode is better
    than the more efficient btoa, in that it doesn't break compress as
    badly.
 4) if lzhuf becomes widly used for news compression btoa becomes
    better for representing binaries, but only a little.

-- 
	bill davidsen		(wedu@crd.GE.COM)
  {uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

wcs@cbnewsh.ATT.COM (william.clare.stewart) (04/18/89)

In article <13603@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>   I have often noted the poor performance of compress on uuencoded
> files when sending news. I therefore took an arc file, converted it to
> text via uuencode, and applied compress and lzhuf to it. I repeated the
> test using the more efficient btoa routine.

	I hate to say it, but you're misunderstanding what's
	happening here,  which means your numbers are probably all bogus.
	Uuencode hardly bothers compress at all - the problem is arc.
	Compress, and other LZW-based programs, compress things by
	taking advantage of redundancy character sequences.
	uuencode doesn't affect this redundacy much - it mostly
	lengthens the sequences making it take a but longer to get
	up to speed.  Arc, on the other hand, *compresses* the file -
	it has several compression  algorithms available to it,
	including "don't", but I think  it commonly uses LZW.
	So compressing an arc file doesn't gain much, because arc
	has already  used up most of the redundancy that compress
	wants to use.  lzhuf apparently uses a different algorithm
	so it has to do better.

	What you need to use as your input file is the stuff that
	went *into* the arc - after all, what you're trying to
	accomplish is using  lzhuf or compress to replace arc.
	You also need to do statistical sampling on a variety of
	inputs - large & small (startup vs longrun effects),
	text/exe/source (each  will have different amounts & kinds
	of redundancy),  and input that has already been  hacked
	through other processors, such as uuencode/btoa, compress,
	huffman, lzhuf! (how do doubly-lzhuf'd files behave?)

> Conclusions:
>  1) lzhuf performs better than compress on uuencoded files
You can't tell because your data was arced first.

>  2) lzhuf performs far better than compress on btoa'd files
potentially interesting

>  3) since most site compress news before sending, uuencode is better
>     than the more efficient btoa, in that it doesn't break compress as
>     badly.
The reason this happens is  that btoa scrambles the data a bit more
- it's working on groups of 4 bytes instead of 2 - and this reduces
the  double-LZW effect.  To draw any  conclusions, you need a lot
more data points though - try taking all the files in
comp.binaries.* and see what happens.

>  4) if lzhuf becomes widly used for news compression btoa becomes
>     better for representing binaries, but only a little.
At least for now, lzhuf implementations are too slow and
CPU-intensve to use for  news compression,  though it has more
potential for compressing the original code.

One thing about LZW - uncompressing is much faster than compressing.
How do the speeds compare for lzhuf?
-- 
# Bill Stewart, AT&T Bell Labs 2G218 Holmdel NJ 201-949-0705 ho95c.att.com!wcs
# "If it weren't for us, American troops would be invading exotic places like
# Lebanon and Grenada, and the Air Force would do stuff like bombing Libya"
#		Abbie Hoffman, R.I.P