[net.micro.mac] Backbone automatic news-compression question ...

werner@ut-ngp.UUCP (Werner Uhrig) (09/19/86)

I'd like a clarification or poll your opinion on the following matter:

It has been mentioned several times (though I don't remember neither context
nor details) that several sites compress the news for transmission for
efficiency reasons, which makes perfectly good sense to me and I'd expect
that more and more machines will use this scheme.

Now, it has also been said that it would, therefore, make no sense for users
to apply their own compression before posting, because that would defeat the
automatic news-compression and cause it to *INCREASE* the time it would take
to transmit.

Now, in net.micro.mac we currently have a discussion about user-compression
of articles containing large text, sources, hexed executables, whatever
using a program called PackIt, which comes with 3 features:

1) make one file out of several (because they belong together for some reason)

2) compress a file or files (note that this makes sense for a single file)

3) encrypt a file or files

The following arguments have been made:

1) doing encryption makes no sense for public postings (and I agree)

2) BinHexing stuff makes always sense to detect errors introduced during
	transmission (I agree again)

3) making one file out of several files belonging together is desirable.
	(I agree with that, except if the resulting file would be too large
	for one posting and would, therefore, have to be broken up into
	separate articles, in which case I sometimes do not pack everything
	into one file, even if individual files are large enough to require
	breaking it up into multiple articles anyway; but I don't feel strongly
	about this)

AND NOW THE ONE I HAVE PROBLEMS WITH:

4) using the compression feature (which saves about 20% on average) has been
	questioned with the argument that this would cause the Backbone
	news transmission compression algorithm to get defeated.

THE QUESTION:

defeated? increased transmission length? what exactly happens? and is the
loss there minor enough to justify the gains all over the net?

CONSIDER:

If I pack and compress on my MAC before binhexing, the file which I end up
uploading is about 20% smaller.  Therefore, I save 20% upload time and save
all sites that do not have news-compression installed 20% transmission time.
Plus, I save 20% diskspace everywhere the article get archived on disk,
temporarily or permanently (also on back-up tapes).  Also, those folks that
download to their Mac save 20% download-time as well as storage space on the
Mac-disk.

MY OPINION, based on a lack of details about the backbone compression algorithm:

I suspect that a) the backbone compression is defeated by packing at the user
end.  In the sense that no gain in transmission time is made.  However, I
doubt that transmission is lengthened by much if any. Therefore, the only
thing that is lost is the overhead it takes to attempt compression and failing
to achieve anything in the process.  So what ??!!

To have some basis for discussion, let me state an example:

I'm about to post a Macintosh executable program, with the following sizes:

26,829 bytes	executable program on the Mac
35945 bytes	binhexed executable
22,455 bytes	compressed executable
30,760 bytes	binhexed compressed executable

my alternatives are to post either 31K or 36K - given that there are trade-offs,
what is the BETTER way to go in your opinion?  There is one additional point
to consider: The version of the program that compresses is SHAREWARE which was
a reason not to post compressed stuff, as it basically forced the receiving
end to uncompress and pay for the program (to stay honest) - but a public
domain program has since been written by someone that does the uncompressing,
but not the compressing, so the receiving end is now without the burden of
"pay or feel guilty" ...

	---Werner	@ngp.cc.utexas.edu	@ut-ngp.UUCP

PS: Follow-up articles are redirected to net.news and will not appear in
	net.micro.mac.

wcs@ho95e.UUCP (#Bill_Stewart) (09/21/86)

In article <4012@ut-ngp.UUCP> werner@ut-ngp.UUCP (Werner Uhrig) writes:
<discussion of whether to compress or not, since the backbone sites
<mostly compress news anyway, and double-compression tends to be
<counter-productive.

I've seen this done in two areas -
	Binary programs for PCs
	Very large text files (e.g. the Senate bill)

Binary programs have to be processed by a program such as BinHex or
uuencode to make them safe to transmit; my guess is that compressing
them *before* Binhexing would not have much effect on the results of
compress.  Two comments: for PC programs, the most popular compression
seems to be ARC, which is shareware.  Compress is PD, its behavior
is better known, and source is available from mod.sources; is there
any reason to prefer ARC?  Also, the "btoa" program that comes with
compress expands files by 20% instead of the 33% for uuencode; this
can be significant, if more people would use it.

For large text files, it's normally not appropriate, because of the
double-compression problem.  However, the Senate Bill was >64K
before compression/uuencoding, and significantly smaller afterwards.
This allowed it to survive braindamaged mailers, at the expense of
being more work to read, and the redundancy added by uuencode was
probably exploited by compress on the backbone.

On the subject of compression for transmission, if a backbone site is
transmitting a given message to 24 other sites, does it compress each
outgoing message once, or does it do it 24 times?  This is obviously
moot on a broadcast network (like Stargate, or ihnp4->rest of IH),
but could have a major CPU impact on an overloaded machine like
ucbvax, ihnp4, or allegra.  If I were designing "C News", I'd consider
storing all articles in compressed form, with special pre-defined
tokens for the common header items; this would have major disk space
and transmission savings, and would probably have low CPU impact for
news reading programs.
-- 
# Bill Stewart, AT&T Bell Labs 2G-202, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs

fair@ucbarpa.Berkeley.EDU (Erik E. Fair) (09/21/86)

Included in the netnews distribution since last year has been a public
domain implementation of the Lempel-Ziv compression algorithmn,
described in the June 1984 issue of IEEE Computer magazine, called
"compress.c".  Its maintenance has been taken over by a group of 
data compression hackers who use the "mail.compress" mailing list to
facilitate the exchange of ideas on data compression. The compress
program is currently in release 4.0, and to the best of my knowledge,
4.0 compress knows how to uncompress files compressed by all its
previous versions...

The basic method is that netnews will, for the purposes of transmitting
an article to a site, append the filename of the article (e.g.
/usr/spool/news/net/general/200) to a file that represents what site
you want to send the article to (e.g. /usr/spool/batch/decvax).
Periodically, a shell script is run that collects all the files (i.e.
articles) listed in the file-of-files into a "batch" that is, a single
file containing articles separated by standard markers.

This batch is then compressed using the "compress" program supplied in
the distribution, with whatever parameters that the two sites invovled
have agreed upon (certain sites, like PDP-11's and 80x86 systems can't
use full blast compress because of address space limitations, and for
those sites, the sending site has to give compress an argument
indicating that it should not go hog wild on memory use). Finally, the
script will hand off the batch to the transport mechanism (usually
UUCP) for queueing and subsequent transmission to the neighboring
site. The neighbor site then reverses the procedure (uncompress,
unpack the batch, process the articles) when the news gets there.

On the sites that I run or have set up, not all links are compressed,
but all links are batched. When I compress, I try to build batches of
120K bytes (on the assumption that I will get 50% compression),
yielding a 60K file to send to my neighbor site; this is a magic number
in that it takes approximately 10 minutes to transmit a 60K file at
1200 baud using UUCP, and this is a good granularity if there is any
trouble with the link because UUCP overhead will only beat hard on me
every 10 minutes, but if I lose the link (Telco gets noisy, or either
computer crashes), I have at most 10 minutes of re-transmission to do.

For those who are skeptical of the savings that compression gives, it
was proven by (I think) Jerry Aguirre of Olivetti <jerry@oliveb.UUCP>
that what compress takes in CPU cycles is more than made up for in the
savings of UUCP transmission CPU cycles (to say nothing of what it will
do for your phone bill). Drop a note to him for the exact figures.

As for the answer to Werner Uhrig's question specific question of
whether compressing and BinHex'ing a mac thing would defeat the data
compression that we use at transport; I doubt it, because as I
understand it the process of BinHex'ing expands the object into
printable ASCII again, much like UUUENCODE, which is eminently
compressable (UUCP transport works with 8 bit channels, and handles
binary data fine, it's just news that wants to play with 7 bit text).
However,

1. I'm not one of the data compression hackers who really would know
	for certain.

2. When the difference is only 5K, my opinion is that it is not worth
	the trouble to compress at user-level.

	"The future's so bright, I gotta wear shades..."
			- Tim Buk III

	Erik E. Fair	ucbvax!fair	fair@ucbarpa.berkeley.edu

fair@ucbarpa.Berkeley.EDU (Erik E. Fair) (10/02/86)

In article <857@ho95e.UUCP> wcs@ho95e.UUCP (Bill Stewart 1-201-949-0705) writes:
> [deleted]
>
>On the subject of compression for transmission, if a backbone site is
>transmitting a given message to 24 other sites, does it compress each
>outgoing message once, or does it do it 24 times?  This is obviously
>moot on a broadcast network (like Stargate, or ihnp4->rest of IH),
>but could have a major CPU impact on an overloaded machine like
>ucbvax, ihnp4, or allegra.  If I were designing "C News", I'd consider
>storing all articles in compressed form, with special pre-defined
>tokens for the common header items; this would have major disk space
>and transmission savings, and would probably have low CPU impact for
>news reading programs.

If you send a single article to 24 sites, yes, it normally would get
compressed and sent 24 times. When I ran "dual", which had seven
"leaf" nodes feeding off of it at its peak, I set up a "pseudo-site"
named "leaf" that generated one batch file, and through the magic of
UNIX links, I sent a single batch to all seven sites. The key is that
the sites that you do this for *must* be

	1. real leaf nodes (that is, they don't get news from anywhere
		else, and generate very little themselves, because this
		scheme will send *everything* that they send to you,
		back to them (in addition to sending to all the other
		leaves)).

	2. all using the same unbatching scheme (that is, they must
		all accept the same format of batch file as input; you
		can vary the command given to uux on a per-site
		basis).

In order to realize maximum savings from this, I had to hack uux to
take one more argument (not hard, the diff of the mod was distributed
with netnews for both v7 & sV UUCPs): the -l (minus el) argument; it is
equivalent to the "-c" option, except that it attempts to make a hard
UNIX link between the input file given to uux, and the queue data file
before trying an explicit "copy". The effect was seven jobs queued with
UUCP, but only one copy of the data file, which was a big win for
/usr/spool/uucp disk space; as a result, the incremental cost to "dual"
of feeding a new "leaf" node was more modem time, and a little more
disk space for C. and X. files (i.e. not much).

As for keeping the articles compressed on the disk, I don't think that
would win much because articles are stored as single files (collecting
articles in a newsgroup together would mean you couldn't link articles
across newsgroups), and the average size of article is pretty small
(last I checked (in 1984, admittedly a long time ago) it was 1600 bytes)
so that the resultant savings (if any) would not be much.

	Erik E. Fair	ucbvax!fair	fair@ucbarpa.berkeley.edu