[net.news] Data compression to lower phone

roy@phri.UUCP (Roy Smith) (06/04/86)

In article <8200002@nucsrl> gore@nucsrl.UUCP writes:
> [...] a site that passes news articles to other sites could just pass
> on the compressed form (and tack on a compressed form of articles
> originate from it).

	Funny, I was just talking with someone this afternoon about this.
There are a few problems.  The toughest one is that every article that
passes through here has to get "phri!" prepended to its path line.  No way
to do that without going through the uncompress/compress cycle (at least,
not that I can see).  Pity.
-- 
Roy Smith, {allegra,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

mcb@k.cs.cmu.edu (Michael Browne) (06/04/86)

In article <2369@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>In article <8200002@nucsrl> gore@nucsrl.UUCP writes:
>> [...] a site that passes news articles to other sites could just pass
>> on the compressed form (and tack on a compressed form of articles
>> originate from it).
>
>	Funny, I was just talking with someone this afternoon about this.
>There are a few problems.  The toughest one is that every article that
>passes through here has to get "phri!" prepended to its path line.  No way
>to do that without going through the uncompress/compress cycle (at least,
>not that I can see).  Pity.

OK, so don't compress the whole article, just the body and headers that you
don't have to look at or change.  The headers that you need can be put at
the beginning and the compressed part can appear after a blank line.

Gee, that was easy, which means that I'm undoubtably missing some subtle (or
not so subtle) point.
	--Mike
-- 
UUCP: ..!seismo!k.cs.cmu.edu!mcb		ARPA: mcb@k.cs.cmu.edu

"It came time to move, so I packed up my Salvador Dali print of two 
blindfolded dental hygienists trying to make a circle on an Etch-a-Sketch..."

arnold@ucsfcgl.UUCP (Ken Arnold%CGL) (06/05/86)

In article <2369@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>In article <8200002@nucsrl> gore@nucsrl.UUCP writes:
>> [...] a site that passes news articles to other sites could just pass
>> on the compressed form (and tack on a compressed form of articles
>> originate from it).
>
>	Funny, I was just talking with someone this afternoon about this.
>There are a few problems.  The toughest one is that every article that
>passes through here has to get "phri!" prepended to its path line.  No way
>to do that without going through the uncompress/compress cycle (at least,
>not that I can see).  Pity.

Well, you could just compress the message body and leave the envelope
(the part with the address) uncompressed.  What are the other problems?

		Ken Arnold

aburt@isis.UUCP (06/05/86)

In article <2369@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>In article <8200002@nucsrl> gore@nucsrl.UUCP writes:
>> [...] a site that passes news articles to other sites could just pass
>> on the compressed form (and tack on a compressed form of articles
>> originate from it).
>
>	Funny, I was just talking with someone this afternoon about this.
>There are a few problems.  The toughest one is that every article that
>passes through here has to get "phri!" prepended to its path line.  No way
>to do that without going through the uncompress/compress cycle (at least,
>not that I can see).  Pity.

I see one, but it ain't pretty.  Modify the next version of news software
so that rather than recompressing the news to send it instead sends along the
sitename to prepend on the other end when unbatching.  (Controlled by
a 'sys' file flag, of course so that you don't send wrong paths to those
who can't handle it.)

E.g., site A creates a compressed batch of news and sends to B.  B
sends along the compressed file to C, along with a note saying to prepend
B! to the path when C unbatches.  If C decides to send the compressed
batch on to D, it sends a note saying to prepend C!B!, etc.

(-:
	And hey -- this might a few cents on the phone bills too:
	
	In our neck of the woods, Path lines seem to average 86
	characters (minus the "Path: " and \n).  A random sampling
	showed about 60 articles per 50K batch file.  In the best case,
	the batch we get would be the same batch that left the first
	backbone site the article encountered (which probably would run
	csendbatch to repackage the little batches it got into 50K
	batches to send onward).  So figure that half of the path, or
	43 characters, is sent once rather than 60 times for that
	batch.  If compress reduces the batch to 40% it's size, we have
	(43*60*.4 - 43) or 989 bytes saved per 50K of compressed news,
	about 2%.

	Now, this isn't the problem the idea was intended to solve,
	but it might have small positive financial side effects as well.

	The idea definitely breaks down when site C doesn't want
	net.foobar from B but gets it transmitted anyway.  However,
	as an option where the conditions are right, it might work ok.

:-)
-- 

Andrew Burt
isis!aburt   or   aburt@udenver.csnet

jeff@rtech.UUCP (06/05/86)

> 
> In article <8200002@nucsrl> gore@nucsrl.UUCP writes:
>> [...] a site that passes news articles to other sites could just pass
>> on the compressed form (and tack on a compressed form of articles
>> originate from it).
> 
> There are a few problems.  The toughest one is that every article that
> passes through here has to get "phri!" prepended to its path line.  No way
> to do that without going through the uncompress/compress cycle (at least,
> not that I can see).  Pity.
> -- 
> Roy Smith, {allegra,philabs}!phri!roy

Why not compress the bodies of the articles, but not the headers?
-- 
Jeff Lichtman at rtech (Relational Technology, Inc.)
"Saints should always be judged guilty until they are proved innocent..."

{amdahl, sun}!rtech!jeff
{ucbvax, decvax}!mtxinu!rtech!jeff

dw@rocksvax.UUCP (Don Wegeng) (06/05/86)

In article <1015@k.cs.cmu.edu> mcb@k.cs.cmu.edu (Michael Browne) writes:
>OK, so don't compress the whole article, just the body and headers that you
>don't have to look at or change.  The headers that you need can be put at
>the beginning and the compressed part can appear after a blank line.

Conceptually speaking the header is not part of the message, it is part
of the envelope of the message. One could compress the message itself, but
leave the envelope intact. This is quite similar to the way email ought to
be handled.

Of course this ignores the fact that someone has to implement the software
to do this, and the net as a whole has to be persuaded to use the new
software. History shows that with Usenet neither is easy.

/Don
-- 
Everybody wants a happy ending.

arpa:	Wegeng.Henr@Xerox.COM
uucp:	ihnp4!rocksvax!dw

mark@cbosgd.UUCP (Mark Horton) (06/05/86)

In article <9875@ucsfcgl.ucsfcgl.UUCP> arnold@ucsfcgl.UUCP (Ken Arnold%CGL) writes:
>Well, you could just compress the message body and leave the envelope
>(the part with the address) uncompressed.  What are the other problems?

We're missing two things here.

(1) The average netnews message has as much header as it does body, so
    there would be great savings in compressing the header too.  (This
    isn't hard to fix - you compress all the header lines that don't
    change.)

(2) Compress is a learning program - it works well on very large files
    by noting patterns that occur over and over and compressing them.
    For short messages, it won't have learned much by the end of the
    message.  Perhaps this could be solved by somehow getting compress
    to save context between messages, or having a standard context it
    uses to start, with common sequences.  But you'd have to make sure
    the sender and reciever were in sync at all times.

This sounds hard but do-able.  Anyone want to write it, preferably as
a subroutine so it can be embedded in news senders and readers?

	Mark

henry@utzoo.UUCP (Henry Spencer) (06/05/86)

> There are a few problems.  The toughest one is that every article that
> passes through here has to get "phri!" prepended to its path line.  No way
> to do that without going through the uncompress/compress cycle (at least,
> not that I can see).  Pity.

With serious revisions of the organization of transmission, it could be
done.  If batches are unbatched only for local availability, then the
Path line could be attached to the batch rather than the individual
articles.  Obvious problems with this are cases where one wants selective
relaying, either because some site is not getting all the groups or
because a site is part of a loop in the net and wants to pass on only the
things it hasn't already seen.  [We have both problems.]
-- 
Usenet(n): AT&T scheme to earn
revenue from otherwise-unused	Henry Spencer @ U of Toronto Zoology
late-night phone capacity.	{allegra,ihnp4,decvax,pyramid}!utzoo!henry

mouse@mcgill-vision.UUCP (der Mouse) (06/06/86)

In article <2369@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:
> In article <8200002@nucsrl> gore@nucsrl.UUCP writes:
>> [...] a site that passes news articles to other sites could just pass
>> on the compressed form [...]
> [...] The toughest [problem] is [that of appending the hostname to
> the Path: line].  No way to do that [still compressed].  Pity.

So compress just the invariant text, big deal.

[ First it was line eater lines.
  Now it's inews pacifier lines.
  What will it be tomorrow? ]
-- 
					der Mouse

USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse
     philabs!micomvax!musocs!mcgill-vision!mouse
Europe: mcvax!decvax!utcsri!mcgill-vision!mouse
        mcvax!seismo!cmcl2!philabs!micomvax!musocs!mcgill-vision!mouse
ARPAnet: utcsri!mcgill-vision!mouse@uw-beaver.arpa

"Come with me a few minutes, mortal, and we shall talk."

phil@portal.UUcp (Phil Sih) (06/06/86)

In article <2369@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:
> 	Funny, I was just talking with someone this afternoon about this.
> There are a few problems.  The toughest one is that every article that
> passes through here has to get "phri!" prepended to its path line.  No way
> to do that without going through the uncompress/compress cycle (at least,
> not that I can see).  Pity.

Wouldn't it be possible to compress only the invariant portion of the
data and not compress the headers?  Perhaps this would involve too much
change to the mail and news systems.  Considering some of the volume
that goes into the data part of net.sources and .maps (or .politics :-)
it might get nearly the same effect as compressing the whole thing.

       Phil Sih at Portal Communications Company, Cupertino, CA

       hoptoad!portal!phil
       atari!portal!phil
       sun!portal!phil

jer@peora.UUCP (J. Eric Roskos) (06/06/86)

> OK, so don't compress the whole article, just the body and headers that you
> don't have to look at or change.  The headers that you need can be put at
> the beginning and the compressed part can appear after a blank line.

Well, the only problem with that is that the compression scheme always
works on ordered pairs (p,n) where p is "something seen previously", and n
is "the next character seen".  The reason that the scheme works without
having to carry along with it a translation table is that p can denote an
arbitrarily long sequence of characters, but this sequence has to be built
up by experience as the program sees these pairs.  So, for example, the
first time you see the sequence "abc", the compressed output would be a
code for "a", a code for "b", and a code for "c", but it would also
remember the pair ("a","b") and save a code for it in an internal table.
So the next time it saw "abc", it would generate the code for ("a","b"),
thus saving some bits over the separate codes for "a" and "b", and then
would output the code for "c", and would remember (("a","b"),"c") in its
table.  Then the next time it saw "abc" it would just output the code for
(("a","b"),"c").  So each time, the output would be a shorter number of
bits than before for the same sequence.  (The start of the code's bits
in the output file is not byte aligned, incidentally, so you don't have
any "wasted" bits if you have an odd number of bits like 9 or 10 in your
code.)

So, as you can see, you only get "good" compression if the file is long
enough for the program to see repeated sequences enough times for it to
build up a code for the longer sequences.  Since the codes start out 9
bits long, as long as the codes are for single characters you don't get
any compression at all -- the file actually turns out bigger than it
started out as.

Since most postings (except postings like mine that ramble on and on over
some obscure topic :-)) are very short, you'd either have to build an
index of a set of articles with the modified headers separate and concatenate
the rest of the set of articles into one long file and compress that, or
else the compression wouldn't work very well.
-- 
E. Roskos
"Winds allow other skylines to hold you."

levy@ttrdc.UUCP (Daniel R. Levy) (06/08/86)

>> [...] a site that passes news articles to other sites could just pass
>> on the compressed form (and tack on a compressed form of articles
>> originate from it).
>
>	Funny, I was just talking with someone this afternoon about this.
>There are a few problems.  The toughest one is that every article that
>passes through here has to get "phri!" prepended to its path line.  No way
>to do that without going through the uncompress/compress cycle (at least,
>not that I can see).  Pity.

There must be some sites out there which ARE passing on articles untouched
however.  It becomes glaringly evident when I attempt to reply to some sites
using the given path and have the mail bounce because system A knows system
B which in turn knows system C, but system B passes news it gets from system
A verbatim to system C, and system C does not know how to contact system A.
Then when I try to mail ..!C!A!..!joeuser according to the return path, it
bounces.  (It is also possible that system A knows system C and sends news to
it but that system C does not reciprocate the relationship even for mail, to
be sure.  However whenever I've bothered to follow up a failed return path,
and sent mail to the administrator or postmaster of the system that bounced
the mail, I invariably seem to get told [by the postmaster of system C] that
"gee whiz, system B sends us our news, and I don't know why it does not
include its name in the return path.")

Seems like maybe it would help if the news software on systems like C could
check to see if the name of their feed is part of the return path and if not,
prepend that name along with its own.  Perhaps this could also reduce
loopbacks, if system B had more than one feed, call the other one system
D, by allowing system D to know that system B had already handled an article
which B had passed through untouched once and which propagated by tortuous
path from C back to D.

(True, that wouldn't even help if the news had been passed along untouched
over more than one link.  Maybe a protocol for passing news through untouched
while still adding pertinent path information could be worked out?)
-- 
 -------------------------------    Disclaimer:  The views contained herein are
|       dan levy | yvel nad      |  my own and are not at all those of my em-
|         an engihacker @        |  ployer or the administrator of any computer
| at&t computer systems division |  upon which I may hack.
|        skokie, illinois        |
 --------------------------------   Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
						vax135}!ttrdc!levy

rouil@ada-uts (06/09/86)

There are a few problems.  The toughest one is that every article that
passes through here has to get "phri!" prepended to its path line.  No way
to do that without going through the uncompress/compress cycle (at least,
not that I can see).  Pity.
-------------------------
  Unfortunately, I have already missed a good bit of the discussion so if
this is a repeat or just plain wrong, don't flame me.  Constructive
criticism is welcome however.

  In answer to the above problem, would it be possible to compress only
the bodies of the message and leave the path line and any other routing
information uncompressed.  I would think that this would be similiar
to encrypting a message ie. the message is encrypted(compressed) but the
routing information is not.  Some of the problems would be less space/time
saved when the routing info is the majority of the message.  Those annoying
one liners for example.  On messages with "normal" bodies and .sig lines
you would still save space/time on the bodies at least.

On regular messages, just how much space does the routing info take up?

John Rouillard                       ihnp4!inmet!ada-uts!rouil
Intermetrics Inc.
733 Concord Ave
Cambridge, Ma. 02138

Note:  I am not a sys person just an interested user.  The company I work
       would not take responsibility for my existence let alone my opinions

eric@chronon.UUCP (Eric Black) (06/11/86)

In article <2197@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes:
>> [... various suggestions to compress article bodies, not headers ...]
> [... discussion of how adaptive code compression works ...]
>
>So, as you can see, you only get "good" compression if the file is long
>enough for the program to see repeated sequences enough times for it to
>build up a code for the longer sequences.  Since the codes start out 9
>bits long, as long as the codes are for single characters you don't get
>any compression at all -- the file actually turns out bigger than it
>started out as.

It appears that the majority of news traffic consists of three basic
parts:   1) header information,  2) quoted excerpts (?) of other
articles, and 3) new text.

While the particular header on a particular article may be quite different
on a char-for-char basis than any other, a large number of article headers
taken as a whole would seem to be an excellent candidate for this method
of compression.

There is a tendency for articles to come out in clusters in response
to any given prior article, and the quantity of included quotes is not
only higher than it needs to be for most individual articles, but often
the same referential text is included in a large number of followup articles.
This also seems to provide good potential for this compression technique,
assuming that articles containing the same text are compressed together.

New text is usually some form of English, however technically-oriented
it may be, and provides some sort of character distribution as might
be expected from English.  This will benefit from compression even if
the exact text appears in only one article.

One solution might be, then, to separate the bodies of news articles
from the headers, and batch and compress them separately.  This brings
up all sorts of reliability and queueing issues (since the text and the
control information are sent separately), but would allow pass-through
of news articles while requiring decompressing, modifying, and recompressing
the headers only.  The relay site can peruse the batched headers to
determine if it is worth decompressing the article bodies for local
consumption.  Careful partitioning of the newsgroups into batches
could then reduce the cost ($$ and cycles/disks) to relay sites, so
that they may be less reluctant to continue downstream feeding of newsgroups
they don't consume themselves.

This partitioning is not an easy task; it is not clear that there is anything
that could be agreed upon as an optimal solution.  Not having to
recompress passed-through articles (even if they get decompressed
locally) should save some cycles, anyway.
-- 
Eric Black   "Garbage In, Gospel Out"
UUCP:        {sun,pyramid,hplabs,amdcad}!chronon!eric