[comp.org.usenix] USENIX Board Studies UUCP: Compression

limonce@pilot.njin.net (Tom Limoncelli) (12/01/89)

I see two major ways to compress the news batches even more.

1 -- First, don't send messages that have already arrived at the site.
I know the algorithm used now is great, but it is not near-optimal.
How about a "I have xxxx" / "Send me yyy" negotiation?  The two
machines could hangup, do the batching/compressing/fooing/baring, then
dial up again and transmit the batches.
(Maybe this implementation isn't optimal, but I think some kind of
negotiation needs to be worked up.)

2 -- A new compression scheme.  We all know that the current one is
quite good, but I have one addition.  I can compress the following
four lines (not including blanks):

In article <93061@pyramid.pyramid.com> romain@pyramid.pyramid.com (Romain Kang) writes:
> Clearly, the Telebit 'g' spoof works well for us now.  Philosophically,
> though, it bothers me that we have an OS as vendor-independent as UNIX,
> yet we are so dependent on Telebit.  Ideally, other equipment should be

as a string like "<93061@pyramid.pyramid.com>1:3".  This would mean
something like "message '<93061@pyramid.pyramid.com' lines 1 thru 3".
This is 30 bytes instead of about 300 bytes.  This requires (1) the
other site has that message already. (2) the user did only trivial
editing of the old post.

(1) can be solved by intelligent software.  If you're sending the
other message in the batch, you know the other site will have that message.
You could also get smarter with the ihave/sendme negotiation.

(2) can be solved by letting users know that it's "nice" to do "clean"
edits.

Some interesting notes:  The header ("In article...") can be
completely generated from by the destination site.
Usenet software would reach a hypertext state.  Software could store
the messages in a similar format, saving disk space.  The user could
(point|cursor|select) a quote and the software could pop up the entire
article.

Some interesting caveats:  The example had 10:1 compression though
other cases will be quite worse (and others will be quite better).
Even then, not too much of an article is "quoted" material.  Or is it?
I wonder if anyone keeps statistics on such a thing.  Yes, users are
supposed to "summarize when quoting", but they don't; and this would
make the point moot.  Is this good or bad?  Go figure.

What do you think?

-Tom
-- 
Tom Limoncelli -- limonce@pilot.njin.net -- tlimonce@drunivac.bitnet
rutgers!njin!tlimonce -- Drew University, Madison, NJ -- 201-408-5389
"All's well that ends well... if your a functional rationalist."

henry@utzoo.uucp (Henry Spencer) (12/02/89)

In article <Dec.1.00.15.11.1989.10246@pilot.njin.net> limonce@pilot.njin.net (Tom Limoncelli) writes:
>1 -- First, don't send messages that have already arrived at the site.
>I know the algorithm used now is great, but it is not near-optimal.
>How about a "I have xxxx" / "Send me yyy" negotiation?  ...

This already exists:  the ihave/sendme protocol.  It's not great.  Apart
from reliability issues, the main problem is that it slows things down
quite a bit.

Most uucp-fed sites don't have multiple feeds, and the answer is always
"no, I haven't seen that yet, please send it".  So the delay and the extra
traffic is pointless.  Ihave/sendme protocols really work *well* only
over real-time connections like the Internet.  (This is what NNTP is
all about.)

>2 -- A new compression scheme.  We all know that the current one is
>quite good, but I have one addition...
>...a string like "<93061@pyramid.pyramid.com>1:3".  This would mean
>something like "message '<93061@pyramid.pyramid.com' lines 1 thru 3".
>This is 30 bytes instead of about 300 bytes.  This requires (1) the
>other site has that message already. (2) the user did only trivial
>editing of the old post.

Unfortunately, (1) simply cannot be taken for granted.  They may not
have received it.  They may have dropped it on the floor.  They may
have already expired it (not an insignificant possibility as volume
continues to rise, and you generally don't know what your neighbors'
expiry policies are).  And sensible posters edit included text quite
a bit, usually.  The bulk of the traffic on the net is *not* included
text.

>...Usenet software would reach a hypertext state...

There are people who are interested in doing this.  However, it's *not*
trivial to do.  It's a research area, not something we can immediately
tap for useful results.
-- 
Mars can wait:  we've barely   |     Henry Spencer at U of Toronto Zoology
started exploring the Moon.    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

brad@looking.on.ca (Brad Templeton) (12/02/89)

The problem is that only detecting forwarded lines and using them if
the original article is in the same batch is exactly what Lempel-Ziv
does for you now!

Why do you think LZ does so well on News batches.   I know that I and
many people have thought a fair bit about better compression schemes for
news.  It is possible, but it's a lot harder than we think.  News has a lot
of repeated strings, and batches have even more, so LZ does a very impressive
job to start with.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

steve@nuchat.UUCP (Steve Nuchia) (12/03/89)

In article <1989Dec1.220852.7325@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>This already exists:  the ihave/sendme protocol.  It's not great.  Apart
>from reliability issues, the main problem is that it slows things down
>quite a bit.

Ihave/sendme could be greatly improved if the whole history line
rather than just the article id were sent.

This would add information about the arrival time of the article
at the sending site, newsgroup list, and file name (enough info
to calculate a file name) to the ihave message.

Benefits:  The receiving site can implement an arbitrary neswgoup
acceptance policy without cooperation from the remote -- no more
remote sys file maintenance headaches.

A minor benefit (for senders with good a good dbm) is that the
history lookup can be bypassed when servicing the sendme.

A major benefit is that receivers can implement newsgroup-specific
cost optimization policies.  For instance, someone with a uunet
line and a couple of local lines could decide to order anything
in comp.all from uunet within two hours after it gets there but
wait 12 hours for other traffic.
-- 
Steve Nuchia	      South Coast Computing Services      (713) 964-2462
"Man is still the best computer that we can put aboard a spacecraft --
 and the only one that can be mass produced with unskilled labor."
					- Wernher von Braun