roy@phri.UUCP (Roy Smith) (06/04/86)
In article <8200002@nucsrl> gore@nucsrl.UUCP writes: > [...] a site that passes news articles to other sites could just pass > on the compressed form (and tack on a compressed form of articles > originate from it). Funny, I was just talking with someone this afternoon about this. There are a few problems. The toughest one is that every article that passes through here has to get "phri!" prepended to its path line. No way to do that without going through the uncompress/compress cycle (at least, not that I can see). Pity. -- Roy Smith, {allegra,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
mcb@k.cs.cmu.edu (Michael Browne) (06/04/86)
In article <2369@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >In article <8200002@nucsrl> gore@nucsrl.UUCP writes: >> [...] a site that passes news articles to other sites could just pass >> on the compressed form (and tack on a compressed form of articles >> originate from it). > > Funny, I was just talking with someone this afternoon about this. >There are a few problems. The toughest one is that every article that >passes through here has to get "phri!" prepended to its path line. No way >to do that without going through the uncompress/compress cycle (at least, >not that I can see). Pity. OK, so don't compress the whole article, just the body and headers that you don't have to look at or change. The headers that you need can be put at the beginning and the compressed part can appear after a blank line. Gee, that was easy, which means that I'm undoubtably missing some subtle (or not so subtle) point. --Mike -- UUCP: ..!seismo!k.cs.cmu.edu!mcb ARPA: mcb@k.cs.cmu.edu "It came time to move, so I packed up my Salvador Dali print of two blindfolded dental hygienists trying to make a circle on an Etch-a-Sketch..."
arnold@ucsfcgl.UUCP (Ken Arnold%CGL) (06/05/86)
In article <2369@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >In article <8200002@nucsrl> gore@nucsrl.UUCP writes: >> [...] a site that passes news articles to other sites could just pass >> on the compressed form (and tack on a compressed form of articles >> originate from it). > > Funny, I was just talking with someone this afternoon about this. >There are a few problems. The toughest one is that every article that >passes through here has to get "phri!" prepended to its path line. No way >to do that without going through the uncompress/compress cycle (at least, >not that I can see). Pity. Well, you could just compress the message body and leave the envelope (the part with the address) uncompressed. What are the other problems? Ken Arnold
aburt@isis.UUCP (06/05/86)
In article <2369@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >In article <8200002@nucsrl> gore@nucsrl.UUCP writes: >> [...] a site that passes news articles to other sites could just pass >> on the compressed form (and tack on a compressed form of articles >> originate from it). > > Funny, I was just talking with someone this afternoon about this. >There are a few problems. The toughest one is that every article that >passes through here has to get "phri!" prepended to its path line. No way >to do that without going through the uncompress/compress cycle (at least, >not that I can see). Pity. I see one, but it ain't pretty. Modify the next version of news software so that rather than recompressing the news to send it instead sends along the sitename to prepend on the other end when unbatching. (Controlled by a 'sys' file flag, of course so that you don't send wrong paths to those who can't handle it.) E.g., site A creates a compressed batch of news and sends to B. B sends along the compressed file to C, along with a note saying to prepend B! to the path when C unbatches. If C decides to send the compressed batch on to D, it sends a note saying to prepend C!B!, etc. (-: And hey -- this might a few cents on the phone bills too: In our neck of the woods, Path lines seem to average 86 characters (minus the "Path: " and \n). A random sampling showed about 60 articles per 50K batch file. In the best case, the batch we get would be the same batch that left the first backbone site the article encountered (which probably would run csendbatch to repackage the little batches it got into 50K batches to send onward). So figure that half of the path, or 43 characters, is sent once rather than 60 times for that batch. If compress reduces the batch to 40% it's size, we have (43*60*.4 - 43) or 989 bytes saved per 50K of compressed news, about 2%. Now, this isn't the problem the idea was intended to solve, but it might have small positive financial side effects as well. The idea definitely breaks down when site C doesn't want net.foobar from B but gets it transmitted anyway. However, as an option where the conditions are right, it might work ok. :-) -- Andrew Burt isis!aburt or aburt@udenver.csnet
jeff@rtech.UUCP (06/05/86)
> > In article <8200002@nucsrl> gore@nucsrl.UUCP writes: >> [...] a site that passes news articles to other sites could just pass >> on the compressed form (and tack on a compressed form of articles >> originate from it). > > There are a few problems. The toughest one is that every article that > passes through here has to get "phri!" prepended to its path line. No way > to do that without going through the uncompress/compress cycle (at least, > not that I can see). Pity. > -- > Roy Smith, {allegra,philabs}!phri!roy Why not compress the bodies of the articles, but not the headers? -- Jeff Lichtman at rtech (Relational Technology, Inc.) "Saints should always be judged guilty until they are proved innocent..." {amdahl, sun}!rtech!jeff {ucbvax, decvax}!mtxinu!rtech!jeff
dw@rocksvax.UUCP (Don Wegeng) (06/05/86)
In article <1015@k.cs.cmu.edu> mcb@k.cs.cmu.edu (Michael Browne) writes: >OK, so don't compress the whole article, just the body and headers that you >don't have to look at or change. The headers that you need can be put at >the beginning and the compressed part can appear after a blank line. Conceptually speaking the header is not part of the message, it is part of the envelope of the message. One could compress the message itself, but leave the envelope intact. This is quite similar to the way email ought to be handled. Of course this ignores the fact that someone has to implement the software to do this, and the net as a whole has to be persuaded to use the new software. History shows that with Usenet neither is easy. /Don -- Everybody wants a happy ending. arpa: Wegeng.Henr@Xerox.COM uucp: ihnp4!rocksvax!dw
mark@cbosgd.UUCP (Mark Horton) (06/05/86)
In article <9875@ucsfcgl.ucsfcgl.UUCP> arnold@ucsfcgl.UUCP (Ken Arnold%CGL) writes: >Well, you could just compress the message body and leave the envelope >(the part with the address) uncompressed. What are the other problems? We're missing two things here. (1) The average netnews message has as much header as it does body, so there would be great savings in compressing the header too. (This isn't hard to fix - you compress all the header lines that don't change.) (2) Compress is a learning program - it works well on very large files by noting patterns that occur over and over and compressing them. For short messages, it won't have learned much by the end of the message. Perhaps this could be solved by somehow getting compress to save context between messages, or having a standard context it uses to start, with common sequences. But you'd have to make sure the sender and reciever were in sync at all times. This sounds hard but do-able. Anyone want to write it, preferably as a subroutine so it can be embedded in news senders and readers? Mark
henry@utzoo.UUCP (Henry Spencer) (06/05/86)
> There are a few problems. The toughest one is that every article that > passes through here has to get "phri!" prepended to its path line. No way > to do that without going through the uncompress/compress cycle (at least, > not that I can see). Pity. With serious revisions of the organization of transmission, it could be done. If batches are unbatched only for local availability, then the Path line could be attached to the batch rather than the individual articles. Obvious problems with this are cases where one wants selective relaying, either because some site is not getting all the groups or because a site is part of a loop in the net and wants to pass on only the things it hasn't already seen. [We have both problems.] -- Usenet(n): AT&T scheme to earn revenue from otherwise-unused Henry Spencer @ U of Toronto Zoology late-night phone capacity. {allegra,ihnp4,decvax,pyramid}!utzoo!henry
mouse@mcgill-vision.UUCP (der Mouse) (06/06/86)
In article <2369@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes: > In article <8200002@nucsrl> gore@nucsrl.UUCP writes: >> [...] a site that passes news articles to other sites could just pass >> on the compressed form [...] > [...] The toughest [problem] is [that of appending the hostname to > the Path: line]. No way to do that [still compressed]. Pity. So compress just the invariant text, big deal. [ First it was line eater lines. Now it's inews pacifier lines. What will it be tomorrow? ] -- der Mouse USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse philabs!micomvax!musocs!mcgill-vision!mouse Europe: mcvax!decvax!utcsri!mcgill-vision!mouse mcvax!seismo!cmcl2!philabs!micomvax!musocs!mcgill-vision!mouse ARPAnet: utcsri!mcgill-vision!mouse@uw-beaver.arpa "Come with me a few minutes, mortal, and we shall talk."
phil@portal.UUcp (Phil Sih) (06/06/86)
In article <2369@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes: > Funny, I was just talking with someone this afternoon about this. > There are a few problems. The toughest one is that every article that > passes through here has to get "phri!" prepended to its path line. No way > to do that without going through the uncompress/compress cycle (at least, > not that I can see). Pity. Wouldn't it be possible to compress only the invariant portion of the data and not compress the headers? Perhaps this would involve too much change to the mail and news systems. Considering some of the volume that goes into the data part of net.sources and .maps (or .politics :-) it might get nearly the same effect as compressing the whole thing. Phil Sih at Portal Communications Company, Cupertino, CA hoptoad!portal!phil atari!portal!phil sun!portal!phil
jer@peora.UUCP (J. Eric Roskos) (06/06/86)
> OK, so don't compress the whole article, just the body and headers that you > don't have to look at or change. The headers that you need can be put at > the beginning and the compressed part can appear after a blank line. Well, the only problem with that is that the compression scheme always works on ordered pairs (p,n) where p is "something seen previously", and n is "the next character seen". The reason that the scheme works without having to carry along with it a translation table is that p can denote an arbitrarily long sequence of characters, but this sequence has to be built up by experience as the program sees these pairs. So, for example, the first time you see the sequence "abc", the compressed output would be a code for "a", a code for "b", and a code for "c", but it would also remember the pair ("a","b") and save a code for it in an internal table. So the next time it saw "abc", it would generate the code for ("a","b"), thus saving some bits over the separate codes for "a" and "b", and then would output the code for "c", and would remember (("a","b"),"c") in its table. Then the next time it saw "abc" it would just output the code for (("a","b"),"c"). So each time, the output would be a shorter number of bits than before for the same sequence. (The start of the code's bits in the output file is not byte aligned, incidentally, so you don't have any "wasted" bits if you have an odd number of bits like 9 or 10 in your code.) So, as you can see, you only get "good" compression if the file is long enough for the program to see repeated sequences enough times for it to build up a code for the longer sequences. Since the codes start out 9 bits long, as long as the codes are for single characters you don't get any compression at all -- the file actually turns out bigger than it started out as. Since most postings (except postings like mine that ramble on and on over some obscure topic :-)) are very short, you'd either have to build an index of a set of articles with the modified headers separate and concatenate the rest of the set of articles into one long file and compress that, or else the compression wouldn't work very well. -- E. Roskos "Winds allow other skylines to hold you."
levy@ttrdc.UUCP (Daniel R. Levy) (06/08/86)
>> [...] a site that passes news articles to other sites could just pass >> on the compressed form (and tack on a compressed form of articles >> originate from it). > > Funny, I was just talking with someone this afternoon about this. >There are a few problems. The toughest one is that every article that >passes through here has to get "phri!" prepended to its path line. No way >to do that without going through the uncompress/compress cycle (at least, >not that I can see). Pity. There must be some sites out there which ARE passing on articles untouched however. It becomes glaringly evident when I attempt to reply to some sites using the given path and have the mail bounce because system A knows system B which in turn knows system C, but system B passes news it gets from system A verbatim to system C, and system C does not know how to contact system A. Then when I try to mail ..!C!A!..!joeuser according to the return path, it bounces. (It is also possible that system A knows system C and sends news to it but that system C does not reciprocate the relationship even for mail, to be sure. However whenever I've bothered to follow up a failed return path, and sent mail to the administrator or postmaster of the system that bounced the mail, I invariably seem to get told [by the postmaster of system C] that "gee whiz, system B sends us our news, and I don't know why it does not include its name in the return path.") Seems like maybe it would help if the news software on systems like C could check to see if the name of their feed is part of the return path and if not, prepend that name along with its own. Perhaps this could also reduce loopbacks, if system B had more than one feed, call the other one system D, by allowing system D to know that system B had already handled an article which B had passed through untouched once and which propagated by tortuous path from C back to D. (True, that wouldn't even help if the news had been passed along untouched over more than one link. Maybe a protocol for passing news through untouched while still adding pertinent path information could be worked out?) -- ------------------------------- Disclaimer: The views contained herein are | dan levy | yvel nad | my own and are not at all those of my em- | an engihacker @ | ployer or the administrator of any computer | at&t computer systems division | upon which I may hack. | skokie, illinois | -------------------------------- Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa, vax135}!ttrdc!levy
eric@chronon.UUCP (Eric Black) (06/11/86)
In article <2197@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes: >> [... various suggestions to compress article bodies, not headers ...] > [... discussion of how adaptive code compression works ...] > >So, as you can see, you only get "good" compression if the file is long >enough for the program to see repeated sequences enough times for it to >build up a code for the longer sequences. Since the codes start out 9 >bits long, as long as the codes are for single characters you don't get >any compression at all -- the file actually turns out bigger than it >started out as. It appears that the majority of news traffic consists of three basic parts: 1) header information, 2) quoted excerpts (?) of other articles, and 3) new text. While the particular header on a particular article may be quite different on a char-for-char basis than any other, a large number of article headers taken as a whole would seem to be an excellent candidate for this method of compression. There is a tendency for articles to come out in clusters in response to any given prior article, and the quantity of included quotes is not only higher than it needs to be for most individual articles, but often the same referential text is included in a large number of followup articles. This also seems to provide good potential for this compression technique, assuming that articles containing the same text are compressed together. New text is usually some form of English, however technically-oriented it may be, and provides some sort of character distribution as might be expected from English. This will benefit from compression even if the exact text appears in only one article. One solution might be, then, to separate the bodies of news articles from the headers, and batch and compress them separately. This brings up all sorts of reliability and queueing issues (since the text and the control information are sent separately), but would allow pass-through of news articles while requiring decompressing, modifying, and recompressing the headers only. The relay site can peruse the batched headers to determine if it is worth decompressing the article bodies for local consumption. Careful partitioning of the newsgroups into batches could then reduce the cost ($$ and cycles/disks) to relay sites, so that they may be less reluctant to continue downstream feeding of newsgroups they don't consume themselves. This partitioning is not an easy task; it is not clear that there is anything that could be agreed upon as an optimal solution. Not having to recompress passed-through articles (even if they get decompressed locally) should save some cycles, anyway. -- Eric Black "Garbage In, Gospel Out" UUCP: {sun,pyramid,hplabs,amdcad}!chronon!eric