tneff@bfmny0.BFM.COM (Tom Neff) (07/09/90)
We have recently seen a spate of "source" postings in "uuencoded compressed TAR" form, instead of SHAR or other traditional plain text formats. Now, possibly in response, we are seeing tools to manipulate this format posted. This is a bad trend! Let's not encourage it further. The supposed advantage of shipping files this way is that when all the decoding is finally done on the receiver's machine, you are guaranteed the exact byte stream that existed on the source machine -- apparently a very seductive feature for some authors. But the price for this is heavy: * Readers can no longer easily inspect the source postings BEFORE installation, to see if they merit further interest. Often they must spend the time and disk space to unpack everything before deciding whether to keep or delete it. Nor are the usual article scanning tools such as rn's '/' and 'g' commands useful. * Compressed newsfeeds, which already impart whatever transmission efficiency gain LZW can offer, are circumvented and in fact sandbagged by the pre-compression of data. * Crucial source format conversions such as CR/LF replacement, fixed or variable record encoding, ASCII/EBCDIC translation, etc, which automatically take place in plain text news/notes postings, are again circumvented; users in alien environments are left with raw UNIX format bitstreams to deal with. * The format presupposes the existence of decoding tools which may or may not be present in a given environment. Non-UNIX users who lack some of the automated extraction facilities we take for granted -- but who can still hand separate a few simple SHAR's into something useful -- are left out in the cold. These objections are not just quibbles -- they cut to the heart of the question of what a worldwide source text network is supposed to be about. News is not mail; news is not a BBS. The "advantages" of condensing source postings into gibberish are not worth the drawbacks. NOTE: When it is occasionally necessary to distribute small, effectively binary files (i.e., the precise bistream is important) together with larger "vanilla" source postings, as with a LaserJet printer manager, then JUST those special files should be encoded (not compressed) with a simple translator like 'btoa' or uuencode, and the resulting text included in the otherwise plaintext archive. -- Psychoanalysis is the mental illness \\\ Tom Neff it purports to cure. -- Karl Kraus \\\ tneff@bfmn0.BFM.COM
roy@cs.umn.edu (Roy M. Silvernail) (07/09/90)
tneff@bfmny0.BFM.COM (Tom Neff) writes: > We have recently seen a spate of "source" postings in "uuencoded > compressed TAR" form, instead of SHAR or other traditional plain text > formats. Now, possibly in response, we are seeing tools to manipulate > this format posted. This is a bad trend! Let's not encourage it > further. I agree completely! I'm DOS-bound, but I was thinking of having a go at porting the Anonymous Contact Service... unfortunately, the tarfile is replete with unix filenames that will choke a DOS machine. I could mung them manually in a shar. > * The format presupposes the existence of decoding tools which may > or may not be present in a given environment. Non-UNIX users who > lack some of the automated extraction facilities we take for > granted -- but who can still hand separate a few simple SHAR's into > something useful -- are left out in the cold. Tools I have... compress, PAX, uu*code... I also have a brain-dead OS to deal with. Thanks, Tom, for pointing out the problems with compressed tarfiles. -- Roy M. Silvernail | "It won't work... I have an | Opinions found now available at: | exceptionally large mind." | herein are mine, cybrspc!roy@cs.umn.edu | --Marvin, the paranoid android | but you can rent (cyberspace... be here!)| | them.
kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (07/10/90)
Some text has been edited out of the following quotes... In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes: >We have recently seen a spate of "source" postings in "uuencoded >compressed TAR" form, instead of SHAR or other traditional plain text >formats. I'm certainly guilty of posting articles in *.tar.Z.uue format. I'm not entirely happy with it, but I believe there are some valid reasons for using this ugly format... > * Readers can no longer easily inspect the source postings BEFORE > installation, to see if they merit further interest. A valid gripe. I agree with you 100% on this one; it is the main reason I'm not entirely pleased with uuencoding. I always describe the contents of a uuencoded article, in a plain-text paragraph before the "begin" line of the uuencoded stuff. I try to make this description sufficient to allow reeaders to decide whether or not the article worth keeping. > * Compressed newsfeeds, which already impart whatever transmission > efficiency gain LZW can offer, are circumvented and in fact > sandbagged by the pre-compression of data. So sites with compressed newsfeeds don't care a whole lot, but those with uncompressed feeds DO care. Any sites with little free disk space also benefit from the compression. > * Crucial source format conversions such as CR/LF replacement, fixed > or variable record encoding, ASCII/EBCDIC translation, etc, which > automatically take place in plain text news/notes postings, are > again circumvented; users in alien environments are left with > raw UNIX format bitstreams to deal with. But I don't want the network to translate my articles! When I post an article, there's a good chance that it will go from a UNIX machine, through BITNET, to another UNIX machine. Because it went through BITNET, it will have been translated from ASCII into EBCDIC and back into ASCII. This translation may leave scars: some characters may have been transliterated incorrectly, long lines may be silently truncated or split, and whitespace may be changed. And all of this is happenning on machines that I have no control over! When I transmit a file, I want it to be received unchanged. If it must be translated to suit the receiver's environment, then that translation should be done explicitly by the reciever, not magically by some machine halfway between here & there. > * The format presupposes the existence of decoding tools which may > or may not be present in a given environment. They should be. People have been posting them, and they're available at archive sites. Certainly, when I post an article, I do so because I want to make my source code available to people. Anything that limits the availability should be viewed with a critical eye. Uudecode and compress fall into that catagory. So does the BITNET protocol. A user who lacks uuencode and compress can get them from somewhere. A user who has only a BITNET feed is stuck. If there was no such thing as BITNET then I would probably use shar. >Psychoanalysis is the mental illness \\\ Tom Neff >it purports to cure. -- Karl Kraus \\\ tneff@bfmn0.BFM.COM ------------------------------------------------------------------------------- Steve Kirkendall kirkenda@cs.pdx.edu uunet!tektronix!psueea!eecs!kirkenda
chip@tct.uucp (Chip Salzenberg) (07/10/90)
According to kirkenda@eecs.UUCP (Steve Kirkendall): >So sites with compressed newsfeeds don't care a whole lot, but those with >uncompressed feeds DO care. If anyone is insane enough to run an uncompressed newsfeed, then he deserves what he gets. >Any sites with little free disk space also benefit from the >compression. Sites with little disk space shouldn't be receiving the sources groups. -- Chip Salzenberg at ComDev/TCT <chip@tct.uucp>, <uunet!ateng!tct!chip>
doug@letni.UUCP (Doug Davis) (07/10/90)
In article <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >I'm certainly guilty of posting articles in *.tar.Z.uue format. I'm not >entirely happy with it, but I believe there are some valid reasons for >using this ugly format... >I always describe the contents of a uuencoded article, in a plain-text >paragraph before the "begin" line of the uuencoded stuff. I try to >make this description sufficient to allow reeaders to decide whether or >not the article worth keeping. > >> * Compressed newsfeeds, which already impart whatever transmission >> efficiency gain LZW can offer, are circumvented and in fact >> sandbagged by the pre-compression of data. > >So sites with compressed newsfeeds don't care a whole lot, but those with >uncompressed feeds DO care. Any sites with little free disk space also benefit >from the compression. Actually this is a very incorrect assumption, very few newsfeeds any more are not compressed in some way. Compressing/uuencodeing/etc a posting neatly circumvents any compression. The minimal savings on disk space doesn't justify doubleing the phone time it costs the article to get to the site. Disk space is cheap, Memory is cheap, in line compression is cheap. However *PHONE TIME* is expensive. A lot of usenet is in the dialup world, and extra phone costs that are needlessly added on, are not appreciated. >But I don't want the network to translate my articles! Yes you do, unless your posting binarys (which is another pain) >When I post an article, there's a good chance that it will go from a >UNIX machine, through BITNET, to another UNIX machine. Because it >went through BITNET, it will have been translated from ASCII into >EBCDIC and back into ASCII. This translation may leave scars: Sites that have this problem, and they are getting rare, are already dealing with this issue. Dealing with them by costing the rest of us more money is not a viable alternative. Your code needs to be changed in the bitnet world, so it can be used, people know that, software is written to do this for users at those sites FOR users at those sites. Automagicly so they don't have to go dredging for utilities for such things. You have to expect that the site admins might know what they were doing and are not blindly allowing software to hack up your postings. People know how to handle shar's its a nice standard for posting sources. If you have a binary, or an object that needs to be posted as well then by all means compress and uuencode it. But, SHAR that with your sources and post your package that way. It makes more sense and is much more apprecated. doug __ Doug Davis/4409 Sarazen/Mesquite Texas, 75150/214-270-9226 {texsun|lawnet|texbell}!letni!doug or doug@letni.lonestar.org "Be seeing you..."
silvert@cs.dal.ca (Bill Silvert) (07/10/90)
In article <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >Some text has been edited out of the following quotes... ditto >> * The format presupposes the existence of decoding tools which may >> or may not be present in a given environment. > >They should be. People have been posting them, and they're available at >archive sites. Not all common tools are "available", in the sense that they can be recovered from archive sites and recompiled, on all machines. For example, I cannot get 16-bit uncompression on my MS-DOS machine, and the uncompress I ported to my obsolete Unix box is pretty flaky. I support sticking with shar as much as possible, since all you really need is a text editor to pull shar archives apart. The translation problem that Steve refers to could be solved by including an ASCII table at the head of each shar, so that transpositions of characters could be identified. This sort of thing is done by the Dumas version of uuencode. -- William Silvert, Habitat Ecology Division, Bedford Inst. of Oceanography P. O. Box 1006, Dartmouth, Nova Scotia, CANADA B2Y 4A2. Tel. (902)426-1577 UUCP=..!{uunet|watmath}!dalcs!biomel!bill BITNET=bill%biomel%dalcs@dalac InterNet=bill%biomel@cs.dal.ca
tp@mccall.com (07/11/90)
In article <1990Jul10.182546.26487@diku.dk>, thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) writes: > tneff@bfmny0.BFM.COM (Tom Neff) writes: >>[Many good reasons not to tar-compress-uuencode source and other >>plain text in news postings.] >> >> * Compressed newsfeeds, which already impart whatever transmission >> efficiency gain LZW can offer, are circumvented and in fact >> sandbagged by the pre-compression of data. > > name size crummy ASCII graphics > ---------- ------- --------------------- > tar 4718592 tar ------- -60.3% ------> tar.Z > tar.Z 1874378 +37.8% +37.8% > tar.Z.uu.Z 2229065 tar.uu.Z ------- -6.8% ------> tar.Z.uu.Z > > Of course, compression factors will vary widely; I have made this > experiment several times, with the same picture emerging: It pays to > compress before uuencoding, and it pays to compress after, and it pays > best to do both. It is true, as you show, that if you have to post uuencoded stuff, you are better off compressing it first. That is not the issue being argued however. The issue under discussion is whether it is better to post a shar file or a uuencoded compressed tar file. In either case the data will be compressed during the news feed. However, your figures above show that the compressed tar file is smaller than the compressed uuencoded compressed tar file. If we assume that a tar file is about the same size as a shar file, your figures show that posting uuencoded compressed tar files uses MORE bandwith than posting a shar. I KNOW that this is a whopper of an assumption, but I don't have a shar program and my tar writer is a royal pain to use. If someone would take a large amount of source code (like if you happen to have rn or nn or B news lying around) and do the following: 1) shar it and compress it and add up the sizes of all the compressed parts. 2) tar it, compress it, uuencode it, and compress it again and check the size. If the result of (1) is less than the result of (2), then in addition to all the other reasons not to post uuencoded compressed tars, we will find that it ALSO uses more net bandwidth, and thus has NO redeeming features (assuming all ascii text data). If, on the other hand, the result of (2) is less than the result of (1), we will find that there is a redeeming feature, but I will still contend that it is a pain in the rear to receive postings in this format. -- Terry Poot <tp@mccall.com> The McCall Pattern Company (uucp: ...!rutgers!ksuvax1!mccall!tp) 615 McCall Road (800)255-2762, in KS (913)776-4041 Manhattan, KS 66502, USA
thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) (07/11/90)
tneff@bfmny0.BFM.COM (Tom Neff) writes: >[Many good reasons not to tar-compress-uuencode source and other >plain text in news postings.] > > * Compressed newsfeeds, which already impart whatever transmission > efficiency gain LZW can offer, are circumvented and in fact > sandbagged by the pre-compression of data. That turns out not to be the case. It is true that a compressed file will usually expand if it is compressed again. But the intervening uuencode is very important: Compressing a uuencoded file is somewhat independent of compressing the original (*). I made an experiment with a tar of a directory tree with mixed source, binaries, and images. name size crummy ASCII graphics ---------- ------- --------------------- tar 4718592 tar ------- -60.3% ------> tar.Z | | tar.Z 1874378 +37.8% +37.8% | | tar.uu 6501192 V V tar.uu ------- -60.3% ------> tar.Z.uu tar.Z.uu 2582500 | | -63.2% -13.7% tar.uu.Z 2392701 | | V V tar.Z.uu.Z 2229065 tar.uu.Z ------- -6.8% ------> tar.Z.uu.Z Of course, compression factors will vary widely; I have made this experiment several times, with the same picture emerging: It pays to compress before uuencoding, and it pays to compress after, and it pays best to do both. In words: If you have to post uuencoded stuff (tar archives, images, whatever), COMPRESS them first. It is always better: In terms of storage on intermediate nodes and of transmission on non-compressed links it is very much better; it may not save much on compressed links, but it doesn't hurt (contrary to common assertions), and the small saving may still pay for the cost to run compress (and compress has less data to process, anyway, so it doesn't run for so long). I wish this misconception about the badness of compressed uuencoded data on compressed news links would go away; anyone for a news.config FAQ posting? ______________________________________________________________________ (*) An attempt at an explanation: The uuencode process maps the source bytes into a smaller set (64 symbols), and it maps three source bytes into four and puts in newlines. Compress works by finding common byte sequences and mapping them into symbols. A common source sequence will occur in three different ``phases'' after uuencode, and may be broken by newlines, so compress will not find it as easily. Of course, long sequences of identical bytes, as often in images, are immune to the shift effect. On the other hand, a 16-bit compress should be able to map all the 2-symbol uuencode sequences and about one fourth of the 3-symbol ones into a 16-bit symbol, giving a compression of about 12% on the uuencode of a totally random byte sequence. (Running compress after compress-uuencode usually gives between 11% and 14% compression, bearing this out; for this purpose, the first compress effectively gives a random sequence.) So: compress may get more of the ``available compression'' in a given input if it is run before uuencode. On the other hand, compress will be able to undo some of the expansion caused by uuencode, masking the first effect. -- Lars Mathiesen, DIKU, U of Copenhagen, Denmark [uunet!]mcsun!diku!thorinn Institute of Datalogy -- we're scientists, not engineers. thorinn@diku.dk
amos@taux01.nsc.com (Amos Shapir) (07/11/90)
In article <3114@psueea.UUCP> you write: |> * Crucial source format conversions such as CR/LF replacement, fixed |> or variable record encoding, ASCII/EBCDIC translation, etc, which |> automatically take place in plain text news/notes postings, are |> again circumvented; users in alien environments are left with |> raw UNIX format bitstreams to deal with. | |But I don't want the network to translate my articles! When I post an article, |there's a good chance that it will go from a UNIX machine, through BITNET, to |another UNIX machine. Because it went through BITNET, it will have been |translated from ASCII into EBCDIC and back into ASCII. This translation may |leave scars: some characters may have been transliterated incorrectly, long |lines may be silently truncated or split, and whitespace may be changed. And |all of this is happenning on machines that I have no control over! | The point was, what happens to those who use the BITNET/EBCDIC machines? The fact is, they are more numerous than UNIX users. |Certainly, when I post an article, I do so because I want to make my source |code available to people. Anything that limits the availability should be |viewed with a critical eye. Uudecode and compress fall into that catagory. |So does the BITNET protocol. A user who lacks uuencode and compress can get |them from somewhere. A user who has only a BITNET feed is stuck. Precisely for this reason, it is better to write you source in a way that it wouldn't be so sensitive to such changes. -- Amos Shapir amos@taux01.nsc.com, amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522408 TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N
woods@eci386.uucp (Greg A. Woods) (07/11/90)
In article <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: > In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes: > >We have recently seen a spate of "source" postings in "uuencoded > >compressed TAR" form, instead of SHAR or other traditional plain text > >formats. > > I'm certainly guilty of posting articles in *.tar.Z.uue format. I'm not > entirely happy with it, but I believe there are some valid reasons for > using this ugly format... ARGH! ALL YOUR REASONS WERE PREVIOUSLY INVALIDATED BY STEVE! Did you read what he wrote? Did you understand it? IMHO there are *NO* valid reasons for using this ugly format, and you certainly didn't uncover any in your article! > So sites with compressed newsfeeds don't care a whole lot, but those with > uncompressed feeds DO care. Any sites with little free disk space also benefit > from the compression. Sites without compressed newsfeeds know what they are getting, and have chosen to do it this way. You cannot pre-suppose that you are giving them a helping hand by compressing your postings beforehand. Are you trying to tell me that you can waste the bandwidth caused by the failure of compress on large batches between the many normal sites for the very few who have chosen not to use compress for some arcane reason? Sites with little free disk space will *not* gain appreciably from a few obscure people compressing and uuencoding their postings. If disk space is that tight, those sites will have other, much more efficient ways of dealing with the problem. In fact, these sites may actually be impacted greatly by such postings, as every news reader who finds interest in the article may take it upon himself to un-pack his own private copy to see what it's all about! > > * Crucial source format conversions such as CR/LF replacement, fixed > > or variable record encoding, ASCII/EBCDIC translation, etc, which > > automatically take place in plain text news/notes postings, are > > again circumvented; users in alien environments are left with > > raw UNIX format bitstreams to deal with. > > But I don't want the network to translate my articles! >[....] > When I transmit a file, I want it to be received unchanged. If it must be > translated to suit the receiver's environment, then that translation should > be done explicitly by the reciever, not magically by some machine halfway > between here & there. That is usually the case, and what Steve was refering to. It is rare to find sites mangling news which is only passing through these days. The translation usually occurs either during the storing of news, or in the retrieval by the newsreader. Meanwhile, your ugly format has destroyed the automatic translation capabilities of those sites which need it, and forced the end users to individually convert your postings by hand (or with the help of one of the tools Steve refered to, provided they can be made to work in the environment in question). If your code is so gross as to contain escape sequences, or 8-bit data, then you deserve the "conversion"! :-) It will remain the case for quite some time that any files containing anything but the 96 printable characters in the ASCII set will be subject to change, even on UNIX to UNIX links, both with mail, and with news. Since anything but the 96 printable chars isn't printable by definition, how could it be source in the first place? If your files are riddled with such "garbage", please feel free to uuencode them, but please post them to an arbitrary binaries group. > > * The format presupposes the existence of decoding tools which may > > or may not be present in a given environment. > > They should be. People have been posting them, and they're available at > archive sites. Just because they exist for UNIX environments, does not mean they are available at all sites, nor that they are available for other environments, nor that the end user can build and install them. > A user who lacks uuencode and compress can get > them from somewhere. That's the attitude which has caused much frustration to new UNIX users. So many people have been turned off UNIX and usenet because some non-thinking guru said it was easy to snarf something fancy from some far remote site, and port it. Meanwhile the new fellow still hasn't got his modem working well! > If there was no such thing as BITNET then I would probably use shar. Shar neither helps, nor hinders, with transmission through BITNET. Finally, I'm curious just how many BITNET sites are in the top 1000 that Brian Reid posts. What is the total percentage of usenet traffic which flows through them? Are any of them currently mangling news they pass through? -- Greg A. Woods woods@{eci386,gate,robohack,ontmoh,tmsoft}.UUCP +1-416-443-1734 [h] +1-416-595-5425 [w] VE3-TCP Toronto, Ontario CANADA
csu@alembic.acs.com (Dave Mack) (07/11/90)
As the culprit in one of the more recent crimes of this nature, I suppose I should answer this. In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes: >We have recently seen a spate of "source" postings in "uuencoded >compressed TAR" form, instead of SHAR or other traditional plain text >formats. Now, possibly in response, we are seeing tools to manipulate >this format posted. This is a bad trend! Let's not encourage it >further. > >The supposed advantage of shipping files this way is that when all the >decoding is finally done on the receiver's machine, you are guaranteed >the exact byte stream that existed on the source machine -- apparently a >very seductive feature for some authors. But the price for this is >heavy: The supposed advantage in the case of the Anonymous Contact Service software which I recently posted to alt.sources is that the uuencoded compressed tar file was 135K, whereas the corresponding shar file is 235K. Also, my version of shar3.24 died horribly when presented with a directory tree (I now have Rich Salz' cshar kit, including makekit, which solves almost all my problems, except that it insists on putting the README in the second kit.) > * Compressed newsfeeds, which already impart whatever transmission > efficiency gain LZW can offer, are circumvented and in fact > sandbagged by the pre-compression of data. Drivel. See above. I sincerely doubt that recompressing a uuencoded compressed file expands it significantly beyond the overhead already added by uuencode. Sending shar files costs additional disk space, and quite a few news links use 12-bit rather than 16-bit compression. However, since the consensus on the net seems to be that the available transmission bandwidth and disk storage space are both unlimited, my next release of the ACS will be in the form of shar files. As an added bonus, all of the filenames will be under 14 characters in this one. I cannot, however, guarantee that the README will be in Part01. Dave Mack embittered idealist, net.scum, villain, and commercial abuser of the net for over three days.
drd@siia.mv.com (David Dick) (07/11/90)
In <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes: >We have recently seen a spate of "source" postings in "uuencoded >compressed TAR" form, instead of SHAR or other traditional plain text >formats. Now, possibly in response, we are seeing tools to manipulate >this format posted. This is a bad trend! Let's not encourage it >further. > But the price for this is heavy: > [ list of significant reasons omitted ] This is the biggie! > * The format presupposes the existence of decoding tools which may > or may not be present in a given environment. Non-UNIX users who > lack some of the automated extraction facilities we take for > granted -- but who can still hand separate a few simple SHAR's into > something useful -- are left out in the cold. >These objections are not just quibbles -- they cut to the heart of the >question of what a worldwide source text network is supposed to be >about. News is not mail; news is not a BBS. The "advantages" of >condensing source postings into gibberish are not worth the drawbacks. As the net expands to encompass a larger and more diverse audience the familiarity with arcane encoding methods becomes rarified. The whole point of the original "shar" was that it only assumed a shell and a few commands which everyone on the fledgling USENET had. Even among the cognoscenti propogation of new tools and processing methods is not guaranteed; the time pressures of a job can interfere! I think making a significant change in distribution procedures for the benefit of one adjunct to USENET (BITNET), to the disadvantage of much of the rest of the net, is a bad idea. David Dick Software Innovations, Inc. [the Software Moving Company (sm)]
dave@galaxia.Newport.RI.US (David H. Brierley) (07/11/90)
In article <sean.647630062@s.ms.uky.edu> sean@ms.uky.edu (Sean Casey) writes: >Compressing an article reduces phone time. If compress finds a file >is bigger after compression, it doesn't compress it. So the phone >costs really aren't increased by users doing their own compression. Almost, but not quite. If you run the compress program by typing "compress filename" and the resulting compressed file is bigger than the original then compress will save the original and delete the compressed file. On the other hand, if you type "compress <f1 >f2" then the compress program will happily create an output file which is larger than the input file. Since news uses the compress program in a pipeline, this is essentially what happens. Here is an example: -rw-rw---- 1 dave family 73150 May 23 17:32 spool.sum compress <spool.sum >test1.Z -rw-rw---- 1 dave family 15791 Jul 11 09:40 test1.Z compress <test1.Z >test2.Z -rw-rw---- 1 dave family 23099 Jul 11 09:41 test2.Z As you can see, test2.Z is 46% larger than test1.Z. This was done using full 16bit compression. If you use 12bit compression, which a lot of sites are using, the results are even worse. -- David H. Brierley Home: dave@galaxia.Newport.RI.US {rayssd,xanth,att} !galaxia!dave Work: dhb@quahog.ssd.ray.com {uunet,sun,att,uiucdcs} !rayssd!dhb
mlord@bwdls58.bnr.ca (Mark Lord) (07/11/90)
In article <1990Jul10.160257.24183@cs.dal.ca> bill%biomel@cs.dal.ca writes: >Not all common tools are "available", in the sense that they can be >recovered from archive sites and recompiled, on all machines. For >example, I cannot get 16-bit uncompression on my MS-DOS machine, and the >uncompress I ported to my obsolete Unix box is pretty flaky. How strange.. I have no fewer than three (3) independant 16-bit uncompress programs on my MSDOS machine, all of which were easy to obtain from SIMTEL20. Two of them even handle the older 12-bit style as well. Also, I have two versions of programs to handle tar files, and a multitude of UU/XX decoders, again, all from SIMTEL20. If you can send/receive email, you can get them from a LISTSERV near you.
dave@galaxia.Newport.RI.US (David H. Brierley) (07/11/90)
In article <1990Jul10.182546.26487@diku.dk> thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) writes: > name size crummy ASCII graphics > ---------- ------- --------------------- > tar 4718592 tar ------- -60.3% ------> tar.Z > tar.Z 1874378 +37.8% +37.8% > tar.Z.uu.Z 2229065 tar.uu.Z ------- -6.8% ------> tar.Z.uu.Z Several points I would like to make. 1) The compressed-uuencoded-compressed file is almost 20% larger than the compressed file, therefore you have *increased* my phone bills by 20%. I do not exactly appreciate this. 2) You have increased both the amount of disk space and the time required for me to determine if this program is useful to me. First I have to uudecode it, then I have to uncompress it, and then I have to un-tar it. Each of these steps require disk space and time. With a shar posting I can read the entire source before I even save it into my directory. I can also unpack a multi-part shar file one piece at a time and then remove the piece that I just un-shar'ed thus greatly reducing the disk space requirements. 3) "tar" format is a lot less portable than "shar" format. With a shar file I can edit the file names if they are too long for my system V based machine. Try doing that with a tar file. "shar" format can be unpacked on a lot of different systems other than just UNIX. People these days are using your programs in ways you never envisioned and on systems you never envisioned. Even if a program is not really applicable to a particular environment, there are often portions of the program that can be borrowed and used in other applications. 4) With a "shar" format posting I can decide if something is useful before I have all of the pieces. If I then miss one or more pieces I can request them from somewhere knowing that they are useful to me. With a uuencoded tar file I need to have all of the pieces before I can decide if it is really useful to me. I know some people will say "but I precede the posting with a description of what it is" but this is not good enough. Unless the description you post exactly matches the description I have been thinking of for something that I want then I cant really tell if this will be useful to me. There is no substitute for reading through all of the documentation supplied and reading through a good portion of the source code. Besides that, what if the one piece of your posting that I miss is the first one and therefore I never see your description of what it is. In my opinion there are just too many arguments against posting uuencoded tar files to even consider it as a viable alternative to shar files. The only reason I can see for uuencoding something is if it is a binary or if it contains binary data. Even then you should just uuencode that one item and include it in a shar file with the plain text documentation. Please do not post uuencoded tar files! If you are concerned about your program being modified as it is transmitted through BITNET then make sure your source code is portable enough to withstand this. You could also try including checksums in your postings using the "snefru" package that was posted recently. -- David H. Brierley Home: dave@galaxia.Newport.RI.US {rayssd,xanth,att} !galaxia!dave Work: dhb@quahog.ssd.ray.com {uunet,sun,att,uiucdcs} !rayssd!dhb
overby@plains.UUCP (Glen Overby) (07/11/90)
In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) charges: >We have recently seen a spate of "source" postings in "uuencoded >compressed TAR" form, instead of SHAR or other traditional plain text >formats. to which <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) confesses: > I'm certainly guilty of posting articles in *.tar.Z.uue format. I'm not > entirely happy with it, but I believe there are some valid reasons for > using this ugly format... > When I transmit a file, I want it to be received unchanged. If it must be > translated to suit the receiver's environment, then that translation should > be done explicitly by the reciever, not magically by some machine halfway > between here & there. Both Steve and I are Frequent Flamers on comp.os.minix. This group is gatewayed to a LISTSERV list on that bastion of computer networks, Bitnet. I think we've all heard the rhetoric about what Bitnet does to source files, but if not just ask any one of us who have been unfortunate enough to have once been a BITNaut. Every time someone posts a source file which is not uuencoded, they get flamed by a dozen BITNauts who feel ripped off for not having gotten a good copy. But In article <1990Jul10.203015.27282@eci386.uucp> woods@eci386.UUCP (Greg A. Woods claims: > [...] It is rare >to find sites mangling news which is only passing through these days. >The translation usually occurs either during the storing of news, or >in the retrieval by the newsreader. I recall five parts of a Minix upgrade being munged last Christmas Eve (yes, 1989) between vu.nl and nodak.edu, and I think all of it's path was over the Internet. The rationalization for compressing is to compensate for the expansion caused by uuencoding, whose rationalization is, in an acronym, BITNET. Fix the Bitnet problem, and you rid the world of most of the reasons for uuencoding. I offer one sugestion: for groups which are source-only, have the gateway program pump everything thru 'compress | uuencode' before feeding it to Listserv. I still see no solution for discussion groups which also get sources posted to them. While I'm indicting comp.os.minix, I'd like to also charge comp.binaries.* with a similar offense, using arc, zip or zoo instead of compress. Other Solutions, anyone? -- Glen Overby <overby@plains.nodak.edu> uunet!plains!overby (UUCP) overby@plains (Bitnet)
gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (07/12/90)
In article <987@galaxia.Newport.RI.US> dave@galaxia.Newport.RI.US (David H. Brierley) writes: > If you are concerned about your >program being modified as it is transmitted through BITNET then make sure >your source code is portable enough to withstand this. This isn't easy, especially if you are distributing patches. This is why comp.os.minix distributes patches as .tar.Z.uu's. -- "Perhaps I'm commenting a bit cynically, but I think I'm qualified to." - Dan Bernstein
roy@cs.umn.edu (Roy M. Silvernail) (07/12/90)
overby@plains.UUCP (Glen Overby) writes: > While I'm indicting comp.os.minix, I'd like to also charge comp.binaries.* > with a similar offense, using arc, zip or zoo instead of compress. In defense of c.b.ibm.pc, the use of arc/zip/zoo is appropriate, since these are widely available archivers for MS-DOS platforms. The contents of c.b.i.p are binaries _for_ MS-DOS platforms. Frequently, these postings are collections of files, rather than single files. To use compress, you would also have to devise a way to assemble multi-file postings, and they would _still_ have to be uuencoded. (shars of uuencoded compressed files? yuck!) Certainly, compress for MS-DOS machines is available, but arc/zip/zoo are more appropriate. Also, all 3 formats can be unpacked on a Unix box (and arc and zoo files can be assembled, as well), so non-DOS types can peek at things they cannot use. There is a distinct difference between source postings and binary postings. Binaries should properly be packed in a manner appropriate to the target platform, and sources should be left as transparent as possible (i.e. shars). Just my $0.022, adjusted for inflation. -- Roy M. Silvernail | "It won't work... I have an | Opinions found now available at: | exceptionally large mind." | herein are mine, cybrspc!roy@cs.umn.edu | --Marvin, the paranoid android | but you can rent (cyberspace... be here!)| | them.
flee@guardian.cs.psu.edu (Felix Lee) (07/12/90)
On a different note, A group I worked in made the surprising discovery that uuencode, a utility traditionally used to convert binary files to a printable form to pass through mailers, is a utility to "encode a binary file into a different binary file." [Randall Howard <rand@mks.com> usenet <620@longway.TIC.COM> comp.std.unix, 4 Apr 1990] Sending uuencoded files through BITNET is by no means safe. One common munge is stripping trailing blanks. This, at least, is relatively easy to recover from. -- Felix Lee flee@cs.psu.edu
jim@anacom1.UUCP (Jim Bacon) (07/13/90)
In article <5256@plains.UUCP> overby@plains.UUCP (Glen Overby) writes: >In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) charges: >>We have recently seen a spate of "source" postings in "uuencoded >>compressed TAR" form, instead of SHAR or other traditional plain text >>formats. > [stuff deleted] > >to which <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) confesses: >> I'm certainly guilty of posting articles in *.tar.Z.uue format. I'm not >> entirely happy with it, but I believe there are some valid reasons for >> using this ugly format... > [stuff deleted] > >I offer one sugestion: for groups which are source-only, have the gateway >program pump everything thru 'compress | uuencode' before feeding it to >Listserv. I still see no solution for discussion groups which also get >sources posted to them. > >While I'm indicting comp.os.minix, I'd like to also charge comp.binaries.* >with a similar offense, using arc, zip or zoo instead of compress. > >Other Solutions, anyone? I have been involved with the with FIDOnet on MSDOS for a good number of years and have suffered the impact of changing "stanadards" for compression methods. At the start, ARC was the standard. Then PKARC came along. That wasn't much of a problem, but did cause some confusion. Then the lawsuits started flying and we were buried under a slew of new programs, ZIP, ZOO, and a half dozen others. Now, I never know what to expect thru the network and about half of my mail gets lost because I have taken the position that ARC is the standard on my machine. I would strongly urge that only a single compression utility be used as a standard, and for UN*X I would suggest compress. -- Jim Bacon | "A computer's attention span is only Anacom General Corp., CA | as long as its extension cord." jim@anacom1.cpd.com | zardoz!anacom1!jim | Anon
kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (07/13/90)
In article <5256@plains.UUCP> overby@plains.UUCP (Glen Overby) writes: >In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) charges: >>We have recently seen a spate of "source" postings in "uuencoded >>compressed TAR" form, instead of SHAR or other traditional plain text >>formats. > >Every time someone posts a source file which is not uuencoded, they get >flamed by a dozen BITNauts who feel ripped off for not having gotten a good >copy. > > >I offer one sugestion: for groups which are source-only, have the gateway >program pump everything thru 'compress | uuencode' before feeding it to >Listserv. I still see no solution for discussion groups which also get >sources posted to them. > >Other Solutions, anyone? Here's an idea: Lets compromise! Come up with a format that really works! We should be able to come up with a protocol that combines the safety of uuencoding with the readability of shar archives. Some features I would like to see in the ultimate USENET archive format are: 1) The archive should be plain-text. That is, each text file in the archive should be easy to locate within the archive, and it should be readable without the need to extract it. 2) The format would only be used to combine several text files into a single text file. If you really must include a non-text file, then uuencode that one file. 3) Archives should begin with a table of all printable ASCII characters, so we can tell when transliteration has gone awry. 4) The archive program should split long lines when the archive is created, and rejoin them during extraction. 5) Tabs should be expanded to spaces. The extraction program should convert groups of spaces back into tabs. 6) The program that creates the archive should give a warning message when a file's whitespace is likely to be reformated. For example, spaces at the end of a line are a no-no. 7) The extraction program should be clever enough to ignore news headers and other introductory text, just for the sake of convenience. 8) It should be possible to embed one archive inside another. This ability probably wouldn't see much use, but lack of the ability could sure be a nasty surprise to somebody. "What? You mean it only works on *some* text files?" 9) Should we use trigraphs for some of the more troublesome ASCII characters? The extraction utility could convert them back into real characters. Did I miss anything? Did I get anything wrong? Does anybody know of an existing format that comes close to these specs? ------------------------------------------------------------------------------- Steve Kirkendall kirkenda@cs.pdx.edu uunet!tektronix!psueea!eecs!kirkenda
thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) (07/13/90)
dave@galaxia.Newport.RI.US (David H. Brierley) writes: >In article <1990Jul10.182546.26487@diku.dk> thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) writes: >> name size crummy ASCII graphics >> ---------- ------- --------------------- >> tar 4718592 tar ------- -60.3% ------> tar.Z >> tar.Z 1874378 +37.8% +37.8% >> tar.Z.uu.Z 2229065 tar.uu.Z ------- -6.8% ------> tar.Z.uu.Z >1) The compressed-uuencoded-compressed file is almost 20% larger than the >compressed file, therefore you have *increased* my phone bills by 20%. I >do not exactly appreciate this. 1) As I wrote, IF you have to post uuencoded material, it should probably be compressed first. I also wrote that I agree with all the other reasons the original poster gave to AVOID posting uuencoded stuff. I'm not advocating that people waste your bandwith by uuencoding stuff, I'm trying to prevent a mistaken argument from making people always post uuencoded stuff non-compressed --- because that often uses even more bandwith, and almost always uses much more disk space. Compressing before uuencoding often saves 60% on disk and 5-10% on the wire --- but sometimes it will only save ~5% on disk and _waste_ ~20% on a compressed link (some Sun run-length-encoded rasterfiles behave that way). The poster should try to find out how each of his files behaves, and pack each of them in the cheapest way; as ``bandwith on compressed links'' seems to be the most popular cost metric, cheapest probably means ``smallest after compression''. And then make a shar archive of the packed files, so people can decide which they want to unpack. Another problem with this: The result of compressing a single file may be very misleading when we really want to know how much larger it makes a compressed batch of news articles. Compress is a very stateful representation, and in a given batch it may not be able to compress a uuencoded file nearly as much as when taken alone. So even the worst rasterfile example may not affect the size of a batch as much as the numbers lead one to believe. (Normally, compress gets ~13% after any uuencode; in these examples, it gets ~30% after uuencode, but only the usual ~13% after compress-uuencode. In the middle of a batch, the difference might shrink a lot --- possibly to the point where compress-uuencode wins again because it starts out 5% smaller.) 2) I hope you realize that a tar achive has binary file headers and cannot be posted without some sort of encoding, so your 20% are not immediately applicable. However, anybody who uuencodes something which would have got through news as well without encoding deserves your scorn and anger (and in my opinion, this includes anybody who posts a tar archive consisting of ASCII files). And I don't understand why ASCII/EBCDIC problems should be an excuse for uuencode, either. The format uses the ASCII characters '!', '[' and ']', which are among those I've most often seen altered in ASCII->EBCDIC->ASCII translations. If a uuencoded file gets through unscathed, odds are that any printable ASCII file would. But maybe somebody wrote a uudecode which takes input in EBCDIC and outputs in ASCII? -- Lars Mathiesen, DIKU, U of Copenhagen, Denmark [uunet!]mcsun!diku!thorinn Institute of Datalogy -- we're scientists, not engineers. thorinn@diku.dk
tale@cs.rpi.edu (David C Lawrence) (07/13/90)
In <3124@psueea.UUCP> kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) writes:
5) Tabs should be expanded to spaces. The extraction program should convert
groups of spaces back into tabs.
... only if that is from whence they came. I extremely rarely use
tabs. Hate 'em, in fact. You'd have a different copy of the sources
if you just changed all groups of spaces back to tabs based on some
pre-conceived notion of what a tab width is ("8 spaces" is not always
the right answer). In some cases this could be VERY important if you
did it inside some literal that was important to the code. (In the
cases of patches it is not as important because patch(1) does have a
flag to ignore these sorts of differences when checking to see that an
update is right, and you could always warn people to use it.)
6) The program that creates the archive should give a warning message when
a file's whitespace is likely to be reformated. For example, spaces at
the end of a line are a no-no.
I don't think this adequately addresses the above concern.
--
(setq mail '("tale@cs.rpi.edu" "tale@ai.mit.edu" "tale@rpitsmts.bitnet"))
woods@robohack.UUCP (Greg A. Woods) (07/13/90)
In article <5256@plains.UUCP> overby@plains.UUCP (Glen Overby) writes: > The rationalization for compressing is to compensate for the expansion > caused by uuencoding, whose rationalization is, in an acronym, BITNET. > Fix the Bitnet problem, and you rid the world of most of the reasons for > uuencoding. I would suggest that the BITNauts (as you so aptly called them!), since they are the most common complainers, and the ones who are affected first, should be the best ones to lobby the powers that be in the BITNET towers of power. It shouldn't really be that hard. I'd bet the biggest problem is the lack of co-ordination between versions of software on each end of the link. A little co-operative administration, and we wouldn't be discussing all this nonsense! -- Greg A. Woods woods@{robohack,gate,eci386,tmsoft,ontmoh}.UUCP +1 416 443-1734 [h] +1 416 595-5425 [w] VE3-TCP Toronto, Ontario; CANADA
bengtl@maths.lth.se (Bengt Larsson) (07/13/90)
In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: (a list of valid points concerning a "shar" program) >Did I miss anything? Did I get anything wrong? Does anybody know of an >existing format that comes close to these specs? Hmmm, one way to do it would be to write a little "unpacker" program (in C), and distribute it with the archive (in plain text). Suggested format for archive: (borrowed heavily from VMS_SHAR, the "shar" program for VMS) (unpacker program (optional) in plain text here. Let's call it "unpacker.c" For those who don't have it, extract it from here, compile it with "cc -o unpacker unpacker.c" and start unpacking) -- start part 1 -- file packer.txt 744 23642334 X The filename is on a line started by "file", followed by one space, X followed by the filename, a space, a (Unix) protection code (octal, X like for "uuencode") and a checksum. The filename must not contain a space. X X The archived file is mostly normal text. Control characters are escaped X with a backtick followed by three characters with the decimal (octal?) X value of the escaped character. Like `009 for a tab. The backtick X is itself escaped like `096. X X Long lines are folded. Normal lines start with an "X". Continuation V lines (like this one) start with a "V" (that is, newlines are to be skipped V before "V"). X X Since all lines start with a special character, it is possible to X archive archives (the archived file ends with a line not starting with X "X" or "V"). X X Trailing blanks are escaped, just like control characters. X Trailing blanks which result from splitting a long line are also X escaped. When run through the unpacker, all trailing spaces are X stripped first (trailing blanks may have been added somewhere). X X This is a line with some trailing spaces... `032 -- end part 1 -- Anything may come here (News headers, for example). We start the next part with a line which start with "-- start part 2". Note that the headers etc. may be in the middle of a file. All parts in the archive have the same length. Archived files are split routinely between parts. All the unpacker has to do is to look for a line starting with "-- end part xx" and then skip to a line beginning with "-- start part xx+1". The unpacker may (should?) check that the "xx" numbers are correct and in sequence. -- start part 2 -- X X Now we can say something about directories. Lets start a new file X "subdir.txt" in a subdirectory "doc". directory doc 744 file doc/subdir.txt 744 2353453 X X Now we are in the subdirectory. A directory is created by a line X started by "directory". The subdirectory may already exist (that is no X error). Anyway, the protection code is specified like for files. X X When files in a subdirectory are specified, directory parts are separated X by "/" (like in Unix). This should make it possible to write unpackers X for other environments (for example VMS). X X Let's say that the archive should be terminated with a line X "end archive". X -- end part 2 -- end archive The unpacking program could be run like: % unpacker prog.pck.01 prog.pck.02 prog.pck.03 ... or (Unix) % cat prog.pck.?? | unpacker. What do you think? The idea was that the "packer" program may be somewhat complex, but the "unpacker" should be small (could be distributed with the archive in plain text). The "packer" could accept lots of options (for example, which characters to escape, the maximum line length, the maximum part size, maybe maximum length for filenames etc.). Reasonable defaults should be provided. I think the "packer" should default to the "safest" format (escaping tabs and special characters for Bitnet). If the escaping mechanism is turned off, this is just a file splitter/extractor (may be used to split uuencoded GIF files, for example :-) Bengt Larsson. -- Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden Internet: bengtl@maths.lth.se SUNET: TYCHE::BENGT_L
gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) (07/13/90)
In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >Here's an idea: Lets compromise! Come up with a format that really works! Ok, let's set a goal: the files should unpack totally unchanged. >3) Archives should begin with a table of all printable ASCII characters, > so we can tell when transliteration has gone awry. This is good, although if two characters are mapped into one, the process has failed. Formats like btoa avoid this because not many computers map several English alphabetic characters into one. >5) Tabs should be expanded to spaces. The extraction program should convert > groups of spaces back into tabs. How can you do this and preserve the original file? What if the original file has a bunch of spaces in a row? >6) The program that creates the archive should give a warning message when > a file's whitespace is likely to be reformated. For example, spaces at > the end of a line are a no-no. They may be a no-no, but what if you're transmitting context diffs on files which used to have excess spaces... you could presumably think up other situations in which you'd want trailing spaces to be preserved. >9) Should we use trigraphs for some of the more troublesome ASCII characters? > The extraction utility could convert them back into real characters. You'd definately need some way of signaling special characters. Then you could mark tabs and fake ends-of-lines in order to prevent spaces from getting eaten. But then we'd have to figure out every single special character that is at risk for being munged -- {} [] $ |\ By the time you get done, it might be rather hard to read. -- "Perhaps I'm commenting a bit cynically, but I think I'm qualified to." - Dan Bernstein
tp@mccall.com (07/13/90)
In article <1990Jul13.022224.25441@lth.se>, bengtl@maths.lth.se (Bengt Larsson) writes: > In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: > > (a list of valid points concerning a "shar" program) > >>Did I miss anything? Did I get anything wrong? Does anybody know of an >>existing format that comes close to these specs? > > Hmmm, one way to do it would be to write a little "unpacker" program (in C), > and distribute it with the archive (in plain text). It had better be a VERY portable program! > Suggested format for archive: (borrowed heavily from VMS_SHAR, the "shar" > program for VMS) I agree that the VMS_SHARE format is quite good. However, the VMS_SHARE format is self unpacking on VMS, just like a unix shar is on unix, with no tools that are not part of the OS. The VMS_SHARE is a DCL command procedure that contains a TPU program (TPU is the VMS programmable text editor) to unpack the files. The problem with a C program, is that it is VERY hard to write a portable program with no #ifdef's that will do the job. If you go this route, write it strictly as a filter, and invoke it just like you do sed in current shar's, with the <<'EOF' input specifier and the > redirection for output. And for heaven's sake, ONLY WRITE ONE OF THEM, and use only one name for it! Whatever you write, I have to recognize explicitly (I maintain a VMS unshar program, and I'm not interested in making it into a full Bourne shell.) Perhaps this is overkill? Wouldn't it be possible to escape the most troublesome characters in such a way that you could still use sed to unpack it? Anyone currently unpacking unix shar's has already emulated sed to some degree, adding a few more substitute commands couldn't be hard. I don't advocate using AWK, while there is a VMS version, it is large and not widely installed. I suspect MSDOS or Amiga sites would have similar problems. Final note about the other proposed format, DON'T mung spaces into tabs. Some people don't use tabs. The goal should be to do as good a job as possible in reproducing the exact file that was packed. -- Terry Poot <tp@mccall.com> The McCall Pattern Company (uucp: ...!rutgers!ksuvax1!mccall!tp) 615 McCall Road (800)255-2762, in KS (913)776-4041 Manhattan, KS 66502, USA
chip@tct.uucp (Chip Salzenberg) (07/13/90)
According to overby@plains.UUCP (Glen Overby): >Every time someone posts a source file which is not uuencoded, they get >flamed by a dozen BITNauts who feel ripped off for not having gotten a good >copy. As far as I'm concerned, if they can't be troubled to translate ASCII without lossage, munged sources are their own problem. -- Chip, the new t.b answer man <chip@tct.uucp>, <uunet!ateng!tct!chip>
peter@ficc.ferranti.com (Peter da Silva) (07/13/90)
In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: > Here's an idea: Lets compromise! Come up with a format that really works! I've suggested this before... the software tools format. > 1) The archive should be plain-text. That is, each text file in the archive > should be easy to locate within the archive, and it should be readable > without the need to extract it. Headers and tailers are marked by "-h-" and "-t-". Other sequences could be added, like "-d-" for directories. > 2) The format would only be used to combine several text files into a single > text file. If you really must include a non-text file, then uuencode > that one file. Exactly. > 3) Archives should begin with a table of all printable ASCII characters, > so we can tell when transliteration has gone awry. That's a nice enhancement. > 4) The archive program should split long lines when the archive is created, > and rejoin them during extraction. Not currently supported, but see below. > 5) Tabs should be expanded to spaces. The extraction program should convert > groups of spaces back into tabs. No. Tabs should be converted to a unique escape sequence. > 6) The program that creates the archive should give a warning message when > a file's whitespace is likely to be reformated. For example, spaces at > the end of a line are a no-no. No, spaces at the end of a line should be marked. > 7) The extraction program should be clever enough to ignore news headers and > other introductory text, just for the sake of convenience. Anything not between "-h-" and "-t-" can be safely ignored. > 8) It should be possible to embed one archive inside another. This ability > probably wouldn't see much use, but lack of the ability could sure be a > nasty surprise to somebody. "What? You mean it only works on *some* > text files?" Leading dashes are escaped with another dash. > 9) Should we use trigraphs for some of the more troublesome ASCII characters? > The extraction utility could convert them back into real characters. Yes, but not trigraphs. A two-character sequence should be enough... how about "@x" for some value of x? @t would be tab, @! would be |, and so on. Of course "@@" would be "@". Begin *all* lines between -h- and -t- with X, or C if it's a continuation of the previous line. Trailing spaces would have a "@" appended. (of course, some other escape character could be used... Kernighan and Pike use "@" for other software tools tools, is all.). Or how about this: begin each line with T for text, C for continued text, and M for uuencoded lines? -- Peter da Silva. `-_-' +1 713 274 5180. <peter@ficc.ferranti.com>
darcy@druid.uucp (D'Arcy J.M. Cain) (07/13/90)
In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >Here's an idea: Lets compromise! Come up with a format that really works! > Good idea bu I think it involves fixing things in different places rather than trying to come up with a new format. No one is going to switch to a completely new format. >1) The archive should be plain-text. That is, each text file in the archive > should be easy to locate within the archive, and it should be readable > without the need to extract it. Already covered by shar >2) The format would only be used to combine several text files into a single > text file. If you really must include a non-text file, then uuencode > that one file. Shar again >3) Archives should begin with a table of all printable ASCII characters, > so we can tell when transliteration has gone awry. Shar can be modified to put something in comments. >4) The archive program should split long lines when the archive is created, > and rejoin them during extraction. Modify shar to do this. All it would take is to put a '\' if the line goes beyond a certain point and continue on the next line. The shell will put it back together properly. >5) Tabs should be expanded to spaces. The extraction program should convert > groups of spaces back into tabs. Bad idea. What if the original file had no tabs in it? The extraction program would change multiple spaces to tabs thus doing what you are trying to avoid. >6) The program that creates the archive should give a warning message when > a file's whitespace is likely to be reformated. For example, spaces at > the end of a line are a no-no. Could be added to shar if desired. >7) The extraction program should be clever enough to ignore news headers and > other introductory text, just for the sake of convenience. Nice addition but it is also nice to be able to use the shell which everyone has. Non Unix boxes that emulate the work of the shell with an extraction tool can add this if desired and a preprocessor can be added to Unix. I don't see this as being a major 'wannit' though. >8) It should be possible to embed one archive inside another. This ability > probably wouldn't see much use, but lack of the ability could sure be a > nasty surprise to somebody. "What? You mean it only works on *some* > text files?" I have never tried it but I don't believe this is a problem with shar. >9) Should we use trigraphs for some of the more troublesome ASCII characters? > The extraction utility could convert them back into real characters. Again shar can be banged on a little to handle this. Simply change the sed command that it generates to translate trigraphs to the proper character. Then shar can convert all troublesome characters and it will be converted back when the script is run. Create a temporary sed script at the start of the script and use it for all the files. That way sites that have problems with some characters can modify the script before running it so that they stay as trigraphs. This has the added benefit of doing automatic trigraph conversion at sites that require it. Is there a trigraph for tabs? If not just invent one and you can handle the tab problem as well. >Did I miss anything? Did I get anything wrong? Does anybody know of an >existing format that comes close to these specs? shar. :-) if people think there are some good ideas here I am willing to do some hacking on shar/unshar to implement some of this stuff. -- D'Arcy J.M. Cain (darcy@druid) | Government: D'Arcy Cain Consulting | Organized crime with an attitude West Hill, Ontario, Canada | (416) 281-6094 |
jejones@mcrware.UUCP (James Jones) (07/13/90)
In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >Here's an idea: Lets compromise! Come up with a format that really works! > >We should be able to come up with a protocol that combines the safety of >uuencoding with the readability of shar archives. Some features I would >like to see in the ultimate USENET archive format are: [list of constraints edited out] Actually, Eric Tilenius, former moderator of the BITNET CoCo mailing list, came up with a fairly reasonable way of encoding files so that they could survive the ASCII-EBCDIC meatgrinder. It doesn't bother letters, digits, and spaces, so that (1) shar files fed through it are still pretty readable and (2) compression on the output would still pay off. This format, which goes by the name CUTS (dunno what that's an acronym for), might be a usable one. If there's interest (express it via email, please!), I will go digging for the description. James Jones
bengtl@maths.lth.se (Bengt Larsson) (07/14/90)
In article <3119.269d97ea@mccall.com> tp@mccall.com writes: >I agree that the VMS_SHARE format is quite good. However, the VMS_SHARE >format is self unpacking on VMS, just like a unix shar is on unix, with no >tools that are not part of the OS. The VMS_SHARE is a DCL command procedure >that contains a TPU program (TPU is the VMS programmable text editor) to >unpack the files. Yes, that's the big advantage (having TPU around as a standard (although there were some problems with the "standard" from VMS 4.x to VMS 5.x!)). I must confess that I'm not much of a C-programmer, so I really can't say if the "unpacker" could be written portably. It just seemed to me that except for sed, awk, sh etc. the most portable thing on Unix would be a small C program. But, as I said, I'm no expert on portable C. >The problem with a C program, is that it is VERY hard to write a portable >program with no #ifdef's that will do the job. If you go this route, write >it strictly as a filter, and invoke it just like you do sed in current >shar's, with the <<'EOF' input specifier and the > redirection for output. >And for heaven's sake, ONLY WRITE ONE OF THEM, and use only one name for >it! Whatever you write, I have to recognize explicitly (I maintain a VMS >unshar program, and I'm not interested in making it into a full Bourne >shell.) Hmmm, I can imagine the problems with writing an "unshar" for VMS (I'm more familiar with VMS than Unix). As you say, the most important is that whatever is, is a _standard_. A rigidly standardized version of "shar" would do as well. And maybe it would be useful to use the <<'EOF' method for feeding parts to the unpacker. Not so pretty, though :-) >Perhaps this is overkill? Wouldn't it be possible to escape the most >troublesome characters in such a way that you could still use sed to unpack >it? Anyone currently unpacking unix shar's has already emulated sed to some >degree, adding a few more substitute commands couldn't be hard. I don't >advocate using AWK, while there is a >VMS version, it is large and not widely installed. I suspect MSDOS or Amiga >sites would have similar problems. Maybe it is overkill. But what about folded long lines? Can that be unpacked with sed? Substituting the most important characters (for example tab) would be doable in sed, I think. >Final note about the other proposed format, DON'T mung spaces into tabs. Agreed. Tabs should be preserved. Anyway, I hope my proposed format made some food for thought, especially for Unix people. It would be much more portable to different systems than the current versions of "shar". VMS_SHARE is certainly something to be inspired by. Summary of features in VMS_SHARE not present in "shar"s (at least not all of them): 1. Escaping of all characters which are a) not printable ascii, b) likely to be munged by Bitnet. 2. Folding of long lines. 3. Automatic skipping of News headers and such. 4. Checksums as standard (This uses the verb CHECKSUM which comes with VMS). 5. Archived files are routinely split between archive parts, to keep each part a standard size. This of course also handles files bigger than any of the archive parts. Features of my proposed format (advantages relative to "shar"): 1. Much more portable to different architectures (not just Unix). 2. It's easy to find the file names, since they are on lines starting with "file". 3. A standard checksum built in (maybe not a CRC, but something more powerful than a character count). 4. Much more protection against character munging through character escapes. 5. Automatic skipping of News headers and such when unpacking. 6. Handles splitting of files between archive parts routinely. Handles achiving of files bigger than any archive part.. Disadvantages relative to "shar": 1. Slightly less readable, especially if many characters are escaped. 2. You must have an "unpacker" compiled (it may be distributed with the archive). I'm sorry that I'm not much of a C programmer: I will not be implementing this myself. Bengt Larsson. -- Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden Internet: bengtl@maths.lth.se SUNET: TYCHE::BENGT_L -- Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden Internet: bengtl@maths.lth.se SUNET: TYCHE::BENGT_L
wht@n4hgf.Mt-Park.GA.US (Warren Tucker) (07/15/90)
In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes: > if people think there are some good ideas here I am willing to do some > hacking on shar/unshar to implement some of this stuff. Trust me :-), your mailbox will overflow with hate mail. Fortunately the scores of "letter bombs" I got when I hacked on shar a while back were not real or my house would be gone and my yard would be a large crater. However, change is mandated once per century or so. So, as risk of losing my street number to a black hole, I'm posting shar 3.31. flames > /dev/h2o -------------------------------------------------------------------- Warren Tucker, TuckerWare emory!n4hgf!wht or wht@n4hgf.Mt-Park.GA.US Sforzando (It., sfohr-tsahn'-doh). A direction to perform the tone or chord with special stress, or marked and sudden emphasis.
plocher@sally.Sun.COM (John Plocher) (07/16/90)
darcy@druid.uucp (D'Arcy J.M. Cain) writes: >>9) Should we use trigraphs for some of the more troublesome ASCII characters? >> The extraction utility could convert them back into real characters. >Again shar can be banged on a little to handle this. Simply change the >sed command that it generates to translate trigraphs to the proper character. >Then shar can convert all troublesome characters and it will be converted back >when the script is run. Ok, so I generate a new-shar archive that looks like this: sed "s/???/}/" << EOF blah EOF And I send it thru a machine that munges the '}' character. The orig file won't extract correctly now, even with this "smart" sed script tacked on, because now the sed script itself is broken. In fact, sed based triglyph translators can corrupt "correct" text like this: Remind: Should this get fixed ??? ...and... /* How many cards are there ??*/ ...etc... -John
tp@mccall.com (07/16/90)
In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes: > In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >>4) The archive program should split long lines when the archive is created, >> and rejoin them during extraction. > Modify shar to do this. All it would take is to put a '\' if the line goes > beyond a certain point and continue on the next line. The shell will put it > back together properly. Technical question from one who maintains a VMS unshar util. Is this true? What happens if you have a line of text in the actual file that ends with a '\', which is common, for instance, in long #defines in C source? Will current shar/unshar programs fold this into a single line? If not, what is the difference with what you said above? Do any shar utilities out there do this? Anyone want to guess if supporting it will cause more problems than it solves (assuming the conflict in the previous paragraph)? -- Terry Poot <tp@mccall.com> The McCall Pattern Company (uucp: ...!rutgers!ksuvax1!mccall!tp) 615 McCall Road (800)255-2762, in KS (913)776-4041 Manhattan, KS 66502, USA
drd@siia.mv.com (David Dick) (07/16/90)
In <373@anacom1.UUCP> jim@anacom1.UUCP (Jim Bacon) writes: >I would strongly urge that only a single compression utility be used as >a standard, and for UN*X I would suggest compress. It has been reported that Unisys owns a patent on Zempel-Liv or LZW compression (I don't remember which) and believes every current use of compress is in violation! I think they are taking steps to deal with this. So, if you think "compress" will avoid the legal morass that arose in the ARC world, you may be in for a surprise. David Dick Software Innovations, Inc. [the Software Moving Company (sm)]
leilabd@syma.sussex.ac.uk (Leila Burrell-Davis) (07/16/90)
In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >Here's an idea: Lets compromise! Come up with a format that really works! > >We should be able to come up with a protocol that combines the safety of >uuencoding with the readability of shar archives. No one has yet mentioned Brad Templeton's abe, which was designed to solve these problems, but has never achieved widespread usage. I enclose the read.me from the package: ----------------------------------------------------------------------- ABE Ascii-Binary Encoding System by B. Templeton ABE is a replacement for uuencode/uudecode designed to deal with all the typical problems of USENET transmission, along with those of other media. Advantages are: Files are often smaller, and compress well. All printable characters map to themselves, so strings in binaries are readable right in the encoding. All lines are indexed, so sort(1) can repair any random scrambling of lines or files. (This can be turned off.) Extraneous lines (news headers, comments, signatures etc.) are ignored, even in the middle of encodings. A PD tiny decoder is available to include with files for first time users. Files can be split up automatically into equal sized blocks. Blocks can contain redundant information so that the decoder can handle blocks in any order, even with reposted duplicates and extraneous articles. Files with blank regions can be constructed from multi-part encodings with damaged blocks. Multiple files can be placed in one encoding. The decoder is extremely general and configurable, and supports many features not currently found in the encoder, but which other encoder writers might fight useful. In general, a redundant ABE encoding posted to a typical newsgroup over a certain article region can be decoded with something as simple as: dabe /usr/spool/news/comp/binaries/group/3[45]? Where it doesn't matter much if there are postings in a random order, duplicate postings, or inserted articles on other topics. Ie. exactly all the things that are a pain about usenet (or mail) binaries. (You can usually run dabe right on your entire mailbox.) The ABE encoder (and decoder) support 3 different encoding formats. One uses all 94 printable ASCII characters, the other avoids characters that have trouble in ASCII-EBCDIC translations, and the 3rd is the UUENCODE format. (ABE can make files decodable by a typical uudecode program.) ----------------- -- Leila Burrell-Davis, Computing Service, University of Sussex, Brighton, UK Tel: +44 273 678390 Fax: +44 273 678470 Email: leilabd@syma.sussex.ac.uk (JANET: leilabd@uk.ac.sussex.syma)
karl@haddock.ima.isc.com (Karl Heuer) (07/17/90)
In article <3131.26a188ca@mccall.com> tp@mccall.com writes: >In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes: >> Modify shar to [split long lines]. All it would take is to put a '\' if >> the line goes beyond a certain point and continue on the next line. The >> shell will put it back together properly. > >Technical question ... Is this true? No. Since shar'd files are quoted (via the hack of quoting the word that identifies the terminator for the here-document), a backslash is not special there. You'd have to use an unquoted here-document, and then escape any real backslashes, dollar signs, or grave accents that appear in the text. Karl W. Z. Heuer (karl@kelp.ima.isc.com or ima!kelp!karl), The Walking Lint
mykel@saleven.oz (Michael Landers) (07/17/90)
In article <1990Jul13.022224.25441@lth.se> bengtl@maths.lth.se (Bengt Larsson) writes: >Hmmm, one way to do it would be to write a little "unpacker" program (in C), >and distribute it with the archive (in plain text). Although it doesn't come up to the illustrious specs as mentioned within the above article, but if you have read the Obfuscated (sp?) C Contest winners, there is a neat little unpacker writen in something like 1000 characters that is an equivalent to "atob | zcat". This way we can used btoa'd compress'd tars and noone will even notice... Mykel. -- () \\ Black Wind always follows |\/|ykel Landers (mykel@saleven.oz) \\ Where by dark horse rides, _||_ \\ Fire is in my soul, Phone: +612 906 3833 Fax: +612 906 2537 \\ Steel is by my side.
woods@robohack.UUCP (Greg A. Woods) (07/17/90)
In article <138944@sun.Eng.Sun.COM> plocher@sally.Sun.COM (John Plocher) writes: > darcy@druid.uucp (D'Arcy J.M. Cain) writes: > >Again shar can be banged on a little to handle this. Simply change the > >sed command that it generates to translate trigraphs to the proper character. > >Then shar can convert all troublesome characters and it will be converted back > >when the script is run. > > Ok, so I generate a new-shar archive that looks like this: >[blahhh....] > And I send it thru a machine that munges the '}' character. > The orig file won't extract correctly now, even with this > "smart" sed script tacked on, because now the sed script itself > is broken. But at least you only have to fix one [set of] line(s) which you know all about, instead of a million possible unknown points in a source file which got munged. Too bad about the "extra" trigraph in the source. Perhaps shar could check for these, and certainly the sender could pack and unpack before sending to validate the correctness. -- Greg A. Woods woods@{robohack,gate,eci386,tmsoft,ontmoh}.UUCP +1 416 443-1734 [h] +1 416 595-5425 [w] VE3-TCP Toronto, Ontario; CANADA
darcy@druid.uucp (D'Arcy J.M. Cain) (07/17/90)
In article <3131.26a188ca@mccall.com> tp@mccall.com writes: >In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes: >> In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes: >>>4) The archive program should split long lines when the archive is created, >>> and rejoin them during extraction. >> Modify shar to do this. All it would take is to put a '\' if the line goes >> beyond a certain point and continue on the next line. The shell will put it >> back together properly. > >Technical question from one who maintains a VMS unshar util. Is this true? >What happens if you have a line of text in the actual file that ends with a >'\', which is common, for instance, in long #defines in C source? Will >current shar/unshar programs fold this into a single line? If not, what is >the difference with what you said above? > You're right. I usually try things out before posting statements like that but I missed that one. The text between "<< SHAR_EOF" and the end of input text is of course not interpreted by the shell like the commands are. The shar program would need a little more work than I suggested above. I am going to post my genfiles utility. This is a pair of programs that I use for generating ASCII files on multiple systems. I will modify it to include support for breaking long lines before posting it. Perhaps the net would like to look at this as a starting point for some kind of standard distribution method. I only have access to Unix and MSDOS so I can't say if it is suitable for other systems. It does allow stuff to be transmitted in clear readable text. Since I expect I will get many suggestions for improvements should I post to both alt.sources and comp.sources or should I hold the c.s posting till I have a more finished product? -- D'Arcy J.M. Cain (darcy@druid) | Government: D'Arcy Cain Consulting | Organized crime with an attitude West Hill, Ontario, Canada | (416) 281-6094 |
thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) (07/18/90)
drd@siia.mv.com (David Dick) writes: >It has been reported that Unisys owns a patent on Zempel-Liv or >LZW compression (I don't remember which) and believes every current >use of compress is in violation! I think I have seen the paper describing that work. It described the use of some sort of Lempel-Ziv encoding between a mainframe and a disk subsystem --- the processor being so fast that it could compress one block while the previous was being transferred, thus increasing throughput. A patent based on that would probably be a process patent covering the use of a specific algorithm (12 bit compress) in a particular part of the I/O path in a disk system. If UniSys think it covers the use of a PD compress program on a general purpose computer system, either they really got a patent for the algorithm (which they oughn't be able to) or they have misunderstood something. On the other hand, I think the W in LZW may be the author of that paper, which might give them a claim on that version of the algorithm. >I think they are taking steps to deal with this. I hope those steps don't work. -- Lars Mathiesen, DIKU, U of Copenhagen, Denmark [uunet!]mcsun!diku!thorinn Institute of Datalogy -- we're scientists, not engineers. thorinn@diku.dk
mills@ccu.umanitoba.ca (Gary Mills) (07/18/90)
Expanding tabs in a shar is no good because it messes up context diffs. Encoding them is a better idea. I use the following shar from time to time because my Bitnet gateway (via UREP) expands tabs: ----------------8<---------------------8<-------------------8<-------- #!/bin/sh # tshar: shell archiver that makes tabs visible # echo '#! /bin/sh' echo '# This is a shell archive, meaning:' echo '# 1. Remove everything above the #! /bin/sh line.' echo '# 2. Save the resulting text in a file.' echo '# 3. Execute the file with /bin/sh (not csh) to create:' for file in "$@" do echo "# $file" done echo "# This archive created: `date`" logname=${LOGNAME:-"unknown"} fullname=`grep "^$logname" /etc/passwd | cut -d: -f5` echo "# By: $logname ($fullname)" echo 'export PATH; PATH=/bin:/usr/bin:$PATH' echo "tab=\`echo \" \" | awk '{printf \"\\\t\\\n\"}'\`" tab=`echo " " | awk '{printf "\t\n"}'` for file in "$@" do size=`wc -c < $file` echo "echo shar: \"extracting '$file'\" '($size characters)'" echo "if test -f '$file'" echo "then" echo " echo shar: \"will not over-write existing file '$file'\"" echo "else" echo " sed -e 's/^X//' -e \"s/\^I/\$tab/g\" << \\SHAR_EOF > '$file'" sed -e 's/^/X/' -e "s/$tab/\^I/g" $file echo 'SHAR_EOF' echo "if test $size -ne \"\`wc -c < '$file'\`\"" echo "then" echo " echo shar: \"error transmitting '$file'\" '(should have been $size characters)'" echo "fi" echo "fi" done echo "exit 0" echo '# End of shell archive' # -- -Gary Mills- -University of Manitoba- -Winnipeg-