[alt.sources.d] Unnecessary tar-compress-uuencodes

tneff@bfmny0.BFM.COM (Tom Neff) (07/09/90)

We have recently seen a spate of "source" postings in "uuencoded
compressed TAR" form, instead of SHAR or other traditional plain text
formats.  Now, possibly in response, we are seeing tools to manipulate
this format posted.  This is a bad trend!  Let's not encourage it
further.

The supposed advantage of shipping files this way is that when all the
decoding is finally done on the receiver's machine, you are guaranteed
the exact byte stream that existed on the source machine -- apparently a
very seductive feature for some authors.  But the price for this is
heavy:

 * Readers can no longer easily inspect the source postings BEFORE 
   installation, to see if they merit further interest.  Often they
   must spend the time and disk space to unpack everything before
   deciding whether to keep or delete it.  Nor are the usual article
   scanning tools such as rn's '/' and 'g' commands useful.

 * Compressed newsfeeds, which already impart whatever transmission
   efficiency gain LZW can offer, are circumvented and in fact
   sandbagged by the pre-compression of data.

 * Crucial source format conversions such as CR/LF replacement, fixed
   or variable record encoding, ASCII/EBCDIC translation, etc, which
   automatically take place in plain text news/notes postings, are
   again circumvented; users in alien environments are left with
   raw UNIX format bitstreams to deal with.

 * The format presupposes the existence of decoding tools which may
   or may not be present in a given environment.  Non-UNIX users who
   lack some of the automated extraction facilities we take for
   granted -- but who can still hand separate a few simple SHAR's into
   something useful -- are left out in the cold.

These objections are not just quibbles -- they cut to the heart of the
question of what a worldwide source text network is supposed to be
about.  News is not mail; news is not a BBS.  The "advantages" of
condensing source postings into gibberish are not worth the drawbacks.

NOTE: When it is occasionally necessary to distribute small, effectively
binary files (i.e., the precise bistream is important) together with
larger "vanilla" source postings, as with a LaserJet printer manager,
then JUST those special files should be encoded (not compressed) with a
simple translator like 'btoa' or uuencode, and the resulting text
included in the otherwise plaintext archive.
-- 
Psychoanalysis is the mental illness   \\\    Tom Neff
it purports to cure. -- Karl Kraus      \\\   tneff@bfmn0.BFM.COM

roy@cs.umn.edu (Roy M. Silvernail) (07/09/90)

tneff@bfmny0.BFM.COM (Tom Neff) writes:

> We have recently seen a spate of "source" postings in "uuencoded
> compressed TAR" form, instead of SHAR or other traditional plain text
> formats.  Now, possibly in response, we are seeing tools to manipulate
> this format posted.  This is a bad trend!  Let's not encourage it
> further.

I agree completely! I'm DOS-bound, but I was thinking of having a go at
porting the Anonymous Contact Service... unfortunately, the tarfile is
replete with unix filenames that will choke a DOS machine. I could mung
them manually in a shar.

>  * The format presupposes the existence of decoding tools which may
>    or may not be present in a given environment.  Non-UNIX users who
>    lack some of the automated extraction facilities we take for
>    granted -- but who can still hand separate a few simple SHAR's into
>    something useful -- are left out in the cold.

Tools I have... compress, PAX, uu*code... I also have a brain-dead OS to
deal with.

Thanks, Tom, for pointing out the problems with compressed tarfiles.

--
    Roy M. Silvernail   |   "It won't work... I have an  | Opinions found
    now available at:   |   exceptionally large mind."   | herein are mine,
 cybrspc!roy@cs.umn.edu | --Marvin, the paranoid android | but you can rent
(cyberspace... be here!)|                                | them.

kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (07/10/90)

Some text has been edited out of the following quotes...

In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes:
>We have recently seen a spate of "source" postings in "uuencoded
>compressed TAR" form, instead of SHAR or other traditional plain text
>formats.

I'm certainly guilty of posting articles in *.tar.Z.uue format.  I'm not
entirely happy with it, but I believe there are some valid reasons for
using this ugly format...

> * Readers can no longer easily inspect the source postings BEFORE 
>   installation, to see if they merit further interest.

A valid gripe.  I agree with you 100% on this one; it is the main reason
I'm not entirely pleased with uuencoding.

I always describe the contents of a uuencoded article, in a plain-text
paragraph before the "begin" line of the uuencoded stuff.  I try to
make this description sufficient to allow reeaders to decide whether or
not the article worth keeping.

> * Compressed newsfeeds, which already impart whatever transmission
>   efficiency gain LZW can offer, are circumvented and in fact
>   sandbagged by the pre-compression of data.

So sites with compressed newsfeeds don't care a whole lot, but those with
uncompressed feeds DO care.  Any sites with little free disk space also benefit
from the compression.

> * Crucial source format conversions such as CR/LF replacement, fixed
>   or variable record encoding, ASCII/EBCDIC translation, etc, which
>   automatically take place in plain text news/notes postings, are
>   again circumvented; users in alien environments are left with
>   raw UNIX format bitstreams to deal with.

But I don't want the network to translate my articles!  When I post an article,
there's a good chance that it will go from a UNIX machine, through BITNET, to
another UNIX machine.  Because it went through BITNET, it will have been
translated from ASCII into EBCDIC and back into ASCII.  This translation may
leave scars: some characters may have been transliterated incorrectly, long
lines may be silently truncated or split, and whitespace may be changed.  And
all of this is happenning on machines that I have no control over!

When I transmit a file, I want it to be received unchanged.  If it must be
translated to suit the receiver's environment, then that translation should
be done explicitly by the reciever, not magically by some machine halfway
between here & there.

> * The format presupposes the existence of decoding tools which may
>   or may not be present in a given environment.

They should be.  People have been posting them, and they're available at
archive sites.

Certainly, when I post an article, I do so because I want to make my source
code available to people.  Anything that limits the availability should be
viewed with a critical eye.  Uudecode and compress fall into that catagory.
So does the BITNET protocol.  A user who lacks uuencode and compress can get
them from somewhere.  A user who has only a BITNET feed is stuck.

If there was no such thing as BITNET then I would probably use shar.

>Psychoanalysis is the mental illness   \\\    Tom Neff
>it purports to cure. -- Karl Kraus      \\\   tneff@bfmn0.BFM.COM

-------------------------------------------------------------------------------
Steve Kirkendall    kirkenda@cs.pdx.edu    uunet!tektronix!psueea!eecs!kirkenda

chip@tct.uucp (Chip Salzenberg) (07/10/90)

According to kirkenda@eecs.UUCP (Steve Kirkendall):
>So sites with compressed newsfeeds don't care a whole lot, but those with
>uncompressed feeds DO care.

If anyone is insane enough to run an uncompressed newsfeed, then he
deserves what he gets.

>Any sites with little free disk space also benefit from the
>compression.

Sites with little disk space shouldn't be receiving the sources
groups.
-- 
Chip Salzenberg at ComDev/TCT     <chip@tct.uucp>, <uunet!ateng!tct!chip>

doug@letni.UUCP (Doug Davis) (07/10/90)

In article <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>I'm certainly guilty of posting articles in *.tar.Z.uue format.  I'm not
>entirely happy with it, but I believe there are some valid reasons for
>using this ugly format...


>I always describe the contents of a uuencoded article, in a plain-text
>paragraph before the "begin" line of the uuencoded stuff.  I try to
>make this description sufficient to allow reeaders to decide whether or
>not the article worth keeping.
>
>> * Compressed newsfeeds, which already impart whatever transmission
>>   efficiency gain LZW can offer, are circumvented and in fact
>>   sandbagged by the pre-compression of data.
>
>So sites with compressed newsfeeds don't care a whole lot, but those with
>uncompressed feeds DO care.  Any sites with little free disk space also benefit
>from the compression.

Actually this is a very incorrect assumption, very few newsfeeds any
more are not compressed in some way.  Compressing/uuencodeing/etc 
a posting neatly circumvents any compression.  The minimal savings
on disk space doesn't justify doubleing the phone time it costs
the article to get to the site.    Disk space is cheap, Memory
is cheap, in line compression is cheap. However *PHONE TIME* is 
expensive.   A lot of usenet is in the dialup world, and extra
phone costs that are needlessly added on, are not appreciated.

>But I don't want the network to translate my articles!
Yes you do, unless your posting binarys (which is another pain)

>When I post an article, there's a good chance that it will go from a
>UNIX machine, through BITNET, to another UNIX machine.  Because it
>went through BITNET, it will have been translated from ASCII into
>EBCDIC and back into ASCII.  This translation may leave scars:
Sites that have this problem, and they are getting rare, are already
dealing with this issue.  Dealing with them by costing the rest of
us more money is not a viable alternative.

Your code needs to be changed in the bitnet world, so it can be
used, people know that, software is written to do this for users
at those sites FOR users at those sites.  Automagicly so they
don't have to go dredging for utilities for such things.   You
have to expect that the site admins might know what they were
doing and are not blindly allowing software to hack up your
postings.

People know how to handle shar's its a nice standard for posting
sources.  If you have a binary, or an object that needs to be
posted as well then by all means compress and uuencode it.  But,
SHAR that with your sources and post your package that way.  It
makes more sense and is much more apprecated.

doug
__
Doug Davis/4409 Sarazen/Mesquite Texas, 75150/214-270-9226
{texsun|lawnet|texbell}!letni!doug or doug@letni.lonestar.org

                                                              "Be seeing you..."

silvert@cs.dal.ca (Bill Silvert) (07/10/90)

In article <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>Some text has been edited out of the following quotes...
ditto

>> * The format presupposes the existence of decoding tools which may
>>   or may not be present in a given environment.
>
>They should be.  People have been posting them, and they're available at
>archive sites.

Not all common tools are "available", in the sense that they can be
recovered from archive sites and recompiled, on all machines.  For
example, I cannot get 16-bit uncompression on my MS-DOS machine, and the
uncompress I ported to my obsolete Unix box is pretty flaky.

I support sticking with shar as much as possible, since all you really
need is a text editor to pull shar archives apart.

The translation problem that Steve refers to could be solved by
including an ASCII table at the head of each shar, so that
transpositions of characters could be identified.  This sort of thing is
done by the Dumas version of uuencode.


-- 
William Silvert, Habitat Ecology Division, Bedford Inst. of Oceanography
P. O. Box 1006, Dartmouth, Nova Scotia, CANADA B2Y 4A2.  Tel. (902)426-1577
UUCP=..!{uunet|watmath}!dalcs!biomel!bill
BITNET=bill%biomel%dalcs@dalac	InterNet=bill%biomel@cs.dal.ca

tp@mccall.com (07/10/90)

In article <sean.647630062@s.ms.uky.edu>, sean@ms.uky.edu (Sean Casey) writes:
> doug@letni.UUCP (Doug Davis) writes:
> 
> Compressing an article reduces phone time. If compress finds a file
> is bigger after compression, it doesn't compress it. So the phone
> costs really aren't increased by users doing their own compression.

If the user doing compression defeats the compression that would otherwise
be done, than it prevents compress from reducing the costs, which means the
same thing as increasing them. The question to be answered is whether a
compressed uuencoded compressed tar file is smaller or larger than a
compressed shar file, even given that compress will not increase the size
(the extra compress on these being the one done by news when it wants to
send it). I'll leave that little exercise to someone who can do it more
easily.

> Plus, Lempel-Ziv is not the ultimate compressor for certain kinds
> of data.  There's ways of compressing bitmaps, for instance, that
> are a lot more effective.

I believe most of the previous posters on this did say that compressing and
uuencoding a binary (bitmaps certainly qualify) was a valid thing to do.
Just compress and encode the files that need it, though, and shar it up
with the source code.

> I don't see what the problem is. Compress is smart enough to not
> expand files, and it *does* save disk space on the remote site, so
> why complain?

My biggest gripe is that I can't read the stuff to see if I want it. Tom
may post an explanation, but not all posted explanations are useful. The
anonymous contact software has a whole posting describing how to unpack it
and how to build it, with NO information as to what the h*ll it is. Also,
does anyone really think that posting it this way saved anyone any
diskspace, since extra tools were posted to unpack it?

Even good explanations aren't good enough. The rest of you probably don't
care, but I run VMS, and I need to glance at the code to see if it is
useful to me. Some ports are trivial, some are very difficult. At least the
stuff on the ACS software mentioned that it is in perl, which doesn't run
on VMS (at the moment), so I was saved the inordinate hassle of unpacking
that one to find out what it was.

Many people will ignore your posting if they can't unpack it with the tools
immediately at hand, and do so easily. I wonder how many people saw all the
uumerge stuff and the encoded file and just skipped it as being not worth
the trouble to read a bunch of stuff just to figure out how to unpack it.
-- 
Terry Poot <tp@mccall.com>                The McCall Pattern Company
(uucp: ...!rutgers!ksuvax1!mccall!tp)     615 McCall Road
(800)255-2762, in KS (913)776-4041        Manhattan, KS 66502, USA

sean@ms.uky.edu (Sean Casey) (07/11/90)

doug@letni.UUCP (Doug Davis) writes:

|Actually this is a very incorrect assumption, very few newsfeeds any
|more are not compressed in some way.  Compressing/uuencodeing/etc 
|a posting neatly circumvents any compression.  The minimal savings
|on disk space doesn't justify doubleing the phone time it costs
|the article to get to the site.    Disk space is cheap, Memory
|is cheap, in line compression is cheap. However *PHONE TIME* is 
|expensive.   A lot of usenet is in the dialup world, and extra
|phone costs that are needlessly added on, are not appreciated.

Compressing an article reduces phone time. If compress finds a file
is bigger after compression, it doesn't compress it. So the phone
costs really aren't increased by users doing their own compression.

Plus, Lempel-Ziv is not the ultimate compressor for certain kinds
of data.  There's ways of compressing bitmaps, for instance, that
are a lot more effective.

I don't see what the problem is. Compress is smart enough to not
expand files, and it *does* save disk space on the remote site, so
why complain?

Sean

tp@mccall.com (07/11/90)

In article <1990Jul10.182546.26487@diku.dk>, thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) writes:
> tneff@bfmny0.BFM.COM (Tom Neff) writes:
>>[Many good reasons not to tar-compress-uuencode source and other
>>plain text in news postings.]
>>
>> * Compressed newsfeeds, which already impart whatever transmission
>>   efficiency gain LZW can offer, are circumvented and in fact
>>   sandbagged by the pre-compression of data.
> 
> 	name	       size		crummy ASCII graphics
> 	----------  -------		---------------------
> 	tar	    4718592	tar	 ------- -60.3% ------>	tar.Z
> 	tar.Z	    1874378	+37.8%				 +37.8%
> 	tar.Z.uu.Z  2229065	tar.uu.Z -------  -6.8% ------>	tar.Z.uu.Z
> 
> Of course, compression factors will vary widely; I have made this
> experiment several times, with the same picture emerging: It pays to
> compress before uuencoding, and it pays to compress after, and it pays
> best to do both.

It is true, as you show, that if you have to post uuencoded stuff, you
are better off compressing it first. That is not the issue being argued
however. The issue under discussion is whether it is better to post a shar
file or a uuencoded compressed tar file. In either case the data will be
compressed during the news feed. However, your figures above show that the
compressed tar file is smaller than the compressed uuencoded compressed tar
file. 

If we assume that a tar file is about the same size as a shar file, your
figures show that posting uuencoded compressed tar files uses MORE bandwith
than posting a shar.

I KNOW that this is a whopper of an assumption, but I don't have a shar
program and my tar writer is a royal pain to use. If someone would take a
large amount of source code (like if you happen to have rn or nn or B news
lying around) and do the following:

1) shar it and compress it and add up the sizes of all the compressed
parts.

2) tar it, compress it, uuencode it, and compress it again and check the
size.

If the result of (1) is less than the result of (2), then in addition to
all the other reasons not to post uuencoded compressed tars, we will find
that it ALSO uses more net bandwidth, and thus has NO redeeming features
(assuming all ascii text data). If, on the other hand, the result of (2) is
less than the result of (1), we will find that there is a redeeming
feature, but I will still contend that it is a pain in the rear to receive
postings in this format.
-- 
Terry Poot <tp@mccall.com>                The McCall Pattern Company
(uucp: ...!rutgers!ksuvax1!mccall!tp)     615 McCall Road
(800)255-2762, in KS (913)776-4041        Manhattan, KS 66502, USA

thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) (07/11/90)

tneff@bfmny0.BFM.COM (Tom Neff) writes:
>[Many good reasons not to tar-compress-uuencode source and other
>plain text in news postings.]
>
> * Compressed newsfeeds, which already impart whatever transmission
>   efficiency gain LZW can offer, are circumvented and in fact
>   sandbagged by the pre-compression of data.

That turns out not to be the case. It is true that a compressed file
will usually expand if it is compressed again. But the intervening
uuencode is very important: Compressing a uuencoded file is somewhat
independent of compressing the original (*). I made an experiment with
a tar of a directory tree with mixed source, binaries, and images.

	name	       size		crummy ASCII graphics
	----------  -------		---------------------
	tar	    4718592	tar	 ------- -60.3% ------>	tar.Z
				   |                                |   
	tar.Z	    1874378	+37.8%				 +37.8%
				   |				    |
	tar.uu	    6501192	   V				    V
				tar.uu	 ------- -60.3% ------>	tar.Z.uu
	tar.Z.uu    2582500	   |				    |
				-63.2%				 -13.7%   
	tar.uu.Z    2392701	   |				    |     
				   V				    V
	tar.Z.uu.Z  2229065	tar.uu.Z -------  -6.8% ------>	tar.Z.uu.Z

Of course, compression factors will vary widely; I have made this
experiment several times, with the same picture emerging: It pays to
compress before uuencoding, and it pays to compress after, and it pays
best to do both.

In words: If you have to post uuencoded stuff (tar archives, images,
whatever), COMPRESS them first. It is always better: In terms of
storage on intermediate nodes and of transmission on non-compressed
links it is very much better; it may not save much on compressed
links, but it doesn't hurt (contrary to common assertions), and the
small saving may still pay for the cost to run compress (and compress
has less data to process, anyway, so it doesn't run for so long).

I wish this misconception about the badness of compressed uuencoded
data on compressed news links would go away; anyone for a news.config
FAQ posting?
______________________________________________________________________
(*) An attempt at an explanation: The uuencode process maps the source
bytes into a smaller set (64 symbols), and it maps three source bytes
into four and puts in newlines. Compress works by finding common byte
sequences and mapping them into symbols. A common source sequence will
occur in three different ``phases'' after uuencode, and may be broken
by newlines, so compress will not find it as easily. Of course, long
sequences of identical bytes, as often in images, are immune to the
shift effect.

On the other hand, a 16-bit compress should be able to map all the
2-symbol uuencode sequences and about one fourth of the 3-symbol ones
into a 16-bit symbol, giving a compression of about 12% on the
uuencode of a totally random byte sequence. (Running compress after
compress-uuencode usually gives between 11% and 14% compression,
bearing this out; for this purpose, the first compress effectively
gives a random sequence.)

So: compress may get more of the ``available compression'' in a given
input if it is run before uuencode. On the other hand, compress will
be able to undo some of the expansion caused by uuencode, masking the
first effect.

--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcsun!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.      thorinn@diku.dk

amos@taux01.nsc.com (Amos Shapir) (07/11/90)

In article <3114@psueea.UUCP> you write:
|> * Crucial source format conversions such as CR/LF replacement, fixed
|>   or variable record encoding, ASCII/EBCDIC translation, etc, which
|>   automatically take place in plain text news/notes postings, are
|>   again circumvented; users in alien environments are left with
|>   raw UNIX format bitstreams to deal with.
|
|But I don't want the network to translate my articles!  When I post an article,
|there's a good chance that it will go from a UNIX machine, through BITNET, to
|another UNIX machine.  Because it went through BITNET, it will have been
|translated from ASCII into EBCDIC and back into ASCII.  This translation may
|leave scars: some characters may have been transliterated incorrectly, long
|lines may be silently truncated or split, and whitespace may be changed.  And
|all of this is happenning on machines that I have no control over!
|

The point was, what happens to those who use the BITNET/EBCDIC machines?
The fact is, they are more numerous than UNIX users.

|Certainly, when I post an article, I do so because I want to make my source
|code available to people.  Anything that limits the availability should be
|viewed with a critical eye.  Uudecode and compress fall into that catagory.
|So does the BITNET protocol.  A user who lacks uuencode and compress can get
|them from somewhere.  A user who has only a BITNET feed is stuck.

Precisely for this reason, it is better to write you source in a way that
it wouldn't be so sensitive to such changes.

-- 
	Amos Shapir		amos@taux01.nsc.com, amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522408  TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N

peter@ficc.ferranti.com (Peter da Silva) (07/11/90)

In article <sean.647630062@s.ms.uky.edu> sean@ms.uky.edu (Sean Casey) writes:
> doug@letni.UUCP (Doug Davis) writes:
> Compressing an article reduces phone time. If compress finds a file
> is bigger after compression, it doesn't compress it. So the phone
> costs really aren't increased by users doing their own compression.

Sure, because it's compressed *and then uuencoded*. Compressed uuencoded
files are (a) likely bigger than the original, and (b) don't compress
very well.

> I don't see what the problem is. Compress is smart enough to not
> expand files, and it *does* save disk space on the remote site, so
> why complain?

And it doesn't save disk space on the remote site either, because of the
uuencoding.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.
<peter@ficc.ferranti.com>

woods@eci386.uucp (Greg A. Woods) (07/11/90)

In article <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
> In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes:
> >We have recently seen a spate of "source" postings in "uuencoded
> >compressed TAR" form, instead of SHAR or other traditional plain text
> >formats.
> 
> I'm certainly guilty of posting articles in *.tar.Z.uue format.  I'm not
> entirely happy with it, but I believe there are some valid reasons for
> using this ugly format...

ARGH!  ALL YOUR REASONS WERE PREVIOUSLY INVALIDATED BY STEVE!  Did you
read what he wrote?  Did you understand it?  IMHO there are *NO* valid
reasons for using this ugly format, and you certainly didn't uncover
any in your article!

> So sites with compressed newsfeeds don't care a whole lot, but those with
> uncompressed feeds DO care.  Any sites with little free disk space also benefit
> from the compression.

Sites without compressed newsfeeds know what they are getting, and
have chosen to do it this way.  You cannot pre-suppose that you are
giving them a helping hand by compressing your postings beforehand.

Are you trying to tell me that you can waste the bandwidth caused by
the failure of compress on large batches between the many normal sites
for the very few who have chosen not to use compress for some arcane
reason?

Sites with little free disk space will *not* gain appreciably from
a few obscure people compressing and uuencoding their postings.  If
disk space is that tight, those sites will have other, much more
efficient ways of dealing with the problem.  In fact, these sites may
actually be impacted greatly by such postings, as every news reader
who finds interest in the article may take it upon himself to un-pack
his own private copy to see what it's all about!

> > * Crucial source format conversions such as CR/LF replacement, fixed
> >   or variable record encoding, ASCII/EBCDIC translation, etc, which
> >   automatically take place in plain text news/notes postings, are
> >   again circumvented; users in alien environments are left with
> >   raw UNIX format bitstreams to deal with.
> 
> But I don't want the network to translate my articles!
>[....]
> When I transmit a file, I want it to be received unchanged.  If it must be
> translated to suit the receiver's environment, then that translation should
> be done explicitly by the reciever, not magically by some machine halfway
> between here & there.

That is usually the case, and what Steve was refering to.  It is rare
to find sites mangling news which is only passing through these days.
The translation usually occurs either during the storing of news, or
in the retrieval by the newsreader.

Meanwhile, your ugly format has destroyed the automatic translation
capabilities of those sites which need it, and forced the end users to
individually convert your postings by hand (or with the help of one of
the tools Steve refered to, provided they can be made to work in the
environment in question).

If your code is so gross as to contain escape sequences, or 8-bit
data, then you deserve the "conversion"!  :-)  It will remain the case
for quite some time that any files containing anything but the 96
printable characters in the ASCII set will be subject to change, even
on UNIX to UNIX links, both with mail, and with news.  Since anything
but the 96 printable chars isn't printable by definition, how could it
be source in the first place?  If your files are riddled with such
"garbage", please feel free to uuencode them, but please post them to
an arbitrary binaries group.

> > * The format presupposes the existence of decoding tools which may
> >   or may not be present in a given environment.
> 
> They should be.  People have been posting them, and they're available at
> archive sites.

Just because they exist for UNIX environments, does not mean they are
available at all sites, nor that they are available for other
environments, nor that the end user can build and install them.

> A user who lacks uuencode and compress can get
> them from somewhere.

That's the attitude which has caused much frustration to new UNIX
users.  So many people have been turned off UNIX and usenet because
some non-thinking guru said it was easy to snarf something fancy from
some far remote site, and port it.  Meanwhile the new fellow still
hasn't got his modem working well!

> If there was no such thing as BITNET then I would probably use shar.

Shar neither helps, nor hinders, with transmission through BITNET.

Finally, I'm curious just how many BITNET sites are in the top 1000
that Brian Reid posts.  What is the total percentage of usenet traffic
which flows through them?  Are any of them currently mangling news
they pass through?
-- 
						Greg A. Woods

woods@{eci386,gate,robohack,ontmoh,tmsoft}.UUCP
+1-416-443-1734 [h]  +1-416-595-5425 [w]    VE3-TCP	Toronto, Ontario CANADA

csu@alembic.acs.com (Dave Mack) (07/11/90)

As the culprit in one of the more recent crimes of this nature, I
suppose I should answer this.

In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes:
>We have recently seen a spate of "source" postings in "uuencoded
>compressed TAR" form, instead of SHAR or other traditional plain text
>formats.  Now, possibly in response, we are seeing tools to manipulate
>this format posted.  This is a bad trend!  Let's not encourage it
>further.
>
>The supposed advantage of shipping files this way is that when all the
>decoding is finally done on the receiver's machine, you are guaranteed
>the exact byte stream that existed on the source machine -- apparently a
>very seductive feature for some authors.  But the price for this is
>heavy:

The supposed advantage in the case of the Anonymous Contact Service
software which I recently posted to alt.sources is that the uuencoded
compressed tar file was 135K, whereas the corresponding shar file is
235K. Also, my version of shar3.24 died horribly when presented with
a directory tree (I now have Rich Salz' cshar kit, including makekit,
which solves almost all my problems, except that it insists on putting
the README in the second kit.)

> * Compressed newsfeeds, which already impart whatever transmission
>   efficiency gain LZW can offer, are circumvented and in fact
>   sandbagged by the pre-compression of data.

Drivel. See above. I sincerely doubt that recompressing a uuencoded
compressed file expands it significantly beyond the overhead already
added by uuencode. Sending shar files costs additional disk space, and
quite a few news links use 12-bit rather than 16-bit compression.


However, since the consensus on the net seems to be that the available
transmission bandwidth and disk storage space are both unlimited, my
next release of the ACS will be in the form of shar files. As an 
added bonus, all of the filenames will be under 14 characters in this
one.

I cannot, however, guarantee that the README will be in Part01.

Dave Mack
embittered idealist, net.scum, villain, and commercial abuser of the
net for over three days.

drd@siia.mv.com (David Dick) (07/11/90)

In <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes:

>We have recently seen a spate of "source" postings in "uuencoded
>compressed TAR" form, instead of SHAR or other traditional plain text
>formats.  Now, possibly in response, we are seeing tools to manipulate
>this format posted.  This is a bad trend!  Let's not encourage it
>further.

> But the price for this is heavy:

> [ list of significant reasons omitted ]

This is the biggie!

> * The format presupposes the existence of decoding tools which may
>   or may not be present in a given environment.  Non-UNIX users who
>   lack some of the automated extraction facilities we take for
>   granted -- but who can still hand separate a few simple SHAR's into
>   something useful -- are left out in the cold.

>These objections are not just quibbles -- they cut to the heart of the
>question of what a worldwide source text network is supposed to be
>about.  News is not mail; news is not a BBS.  The "advantages" of
>condensing source postings into gibberish are not worth the drawbacks.

As the net expands to encompass a larger and more diverse audience
the familiarity with arcane encoding methods becomes rarified.
The whole point of the original "shar" was that it only assumed 
a shell and a few commands which everyone on the fledgling USENET
had.

Even among the cognoscenti propogation of new tools and processing
methods is not guaranteed; the time pressures of a job can interfere!

I think making a significant change in distribution procedures for
the benefit of one adjunct to USENET (BITNET), to the disadvantage of much
of the rest of the net, is a bad idea.

David Dick
Software Innovations, Inc. [the Software Moving Company (sm)]

woods@robohack.UUCP (Greg A. Woods) (07/11/90)

In article <sean.647630062@s.ms.uky.edu> sean@ms.uky.edu (Sean Casey) writes:
> doug@letni.UUCP (Doug Davis) writes:
> 
> |Actually this is a very incorrect assumption, very few newsfeeds any
> |more are not compressed in some way.  Compressing/uuencodeing/etc 
> |a posting neatly circumvents any compression.  The minimal savings
> 
> Compressing an article reduces phone time. If compress finds a file
> is bigger after compression, it doesn't compress it. So the phone
> costs really aren't increased by users doing their own compression.
> 
> Plus, Lempel-Ziv is not the ultimate compressor for certain kinds
> of data.  There's ways of compressing bitmaps, for instance, that
> are a lot more effective.
> 
> I don't see what the problem is. Compress is smart enough to not
> expand files, and it *does* save disk space on the remote site, so
> why complain?

Because, if you are batching with a more reasonable size like
200-400Kb, a couple of gundged up, uuencoded files may mess up the
compression of the remainder of the articles in the batch.  I've not
actually measured this carefully, but the occasional time I've made a
tar of a directory with a few such files, or worse yet a few already
compressed files, I get 20-50% larger archives than if I unpack those
few files first, then archive and compress the whole works.  Remember
LZ compression uses a dictionary....

As we've been saying over and over, shar's are simple and easily dealt
with in any number of environments.  They don't intrude on the causal
reader's ability to browse, and they don't make your head hurt as it
mine certainly does when I see the even lines of absolute garbage that
uuencode creates.

It's not only the cost of phone lines alone we're worrying about, but
also the manpower involved!

As for more efficient compression algorithms for special forms of
data, please do use them where appropriate, IFF the decoders are
readily available to the target audience.  HOWEVER, PLEASE post them
ONLY to an appropriate binaries group, unless they are part of a
*shar*ed source posting.

As for saving disk space, please don't bother!  The volume involved
amongst the zillions of un-compressed postings is insignificant.
Besides I *do* like to grep through files in the spool directories,
and I *do* like to be able to do regex searches in rn.

Just what is the average end savings in converting an arbitrary binary
to an approx. 35% compressed binary stream, then converting it back to
the 96 printable characters in record format?  The few times I've done
it, the savings have not been appreciable, but were necessary to mail
an otherwise impossible to mail binary.
-- 
						Greg A. Woods

woods@{robohack,gate,eci386,tmsoft,ontmoh}.UUCP
+1 416 443-1734 [h]   +1 416 595-5425 [w]   VE3-TCP   Toronto, Ontario; CANADA

dave@galaxia.Newport.RI.US (David H. Brierley) (07/11/90)

In article <sean.647630062@s.ms.uky.edu> sean@ms.uky.edu (Sean Casey) writes:
>Compressing an article reduces phone time. If compress finds a file
>is bigger after compression, it doesn't compress it. So the phone
>costs really aren't increased by users doing their own compression.

Almost, but not quite.  If you run the compress program by typing
"compress filename" and the resulting compressed file is bigger than
the original then compress will save the original and delete the 
compressed file.  On the other hand, if you type "compress <f1 >f2"
then the compress program will happily create an output file which
is larger than the input file.  Since news uses the compress program
in a pipeline, this is essentially what happens.  Here is an example:

	-rw-rw----  1 dave    family    73150 May 23 17:32 spool.sum
	compress <spool.sum >test1.Z
	-rw-rw----  1 dave    family    15791 Jul 11 09:40 test1.Z
	compress <test1.Z >test2.Z
	-rw-rw----  1 dave    family    23099 Jul 11 09:41 test2.Z

As you can see, test2.Z is 46% larger than test1.Z.  This was done using
full 16bit compression.  If you use 12bit compression, which a lot of sites
are using, the results are even worse.
-- 
David H. Brierley
Home: dave@galaxia.Newport.RI.US    {rayssd,xanth,att} !galaxia!dave
Work: dhb@quahog.ssd.ray.com        {uunet,sun,att,uiucdcs} !rayssd!dhb

mlord@bwdls58.bnr.ca (Mark Lord) (07/11/90)

In article <1990Jul10.160257.24183@cs.dal.ca> bill%biomel@cs.dal.ca writes:
>Not all common tools are "available", in the sense that they can be
>recovered from archive sites and recompiled, on all machines.  For
>example, I cannot get 16-bit uncompression on my MS-DOS machine, and the
>uncompress I ported to my obsolete Unix box is pretty flaky.

How strange.. I have no fewer than three (3) independant 16-bit uncompress
programs on my MSDOS machine, all of which were easy to obtain from SIMTEL20.

Two of them even handle the older 12-bit style as well.  Also, I have two
versions of programs to handle tar files, and a multitude of UU/XX decoders,
again, all from SIMTEL20.

If you can send/receive email, you can get them from a LISTSERV near you.

doug@letni.UUCP (Doug Davis) (07/11/90)

In article <sean.647630062@s.ms.uky.edu> sean@ms.uky.edu (Sean Casey) writes:
>Compressing an article reduces phone time. If compress finds a file
>is bigger after compression, it doesn't compress it. So the phone
>costs really aren't increased by users doing their own compression.

This works for files, the current news software doesn't take advantage 
of compress like that. If you pipe into compress it WILL ALWAYS OUTPUT
compressed style data, either smaller or larger depending on the
effect of the algorythm.


doug
__
Doug Davis/4409 Sarazen/Mesquite Texas, 75150/214-270-9226
{texsun|lawnet|texbell}!letni!doug or doug@letni.lonestar.org

                                                              "Be seeing you..."

dave@galaxia.Newport.RI.US (David H. Brierley) (07/11/90)

In article <1990Jul10.182546.26487@diku.dk> thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) writes:
>	name	       size		crummy ASCII graphics
>	----------  -------		---------------------
>	tar	    4718592	tar	 ------- -60.3% ------>	tar.Z
>	tar.Z	    1874378	+37.8%				 +37.8%
>	tar.Z.uu.Z  2229065	tar.uu.Z -------  -6.8% ------>	tar.Z.uu.Z

Several points I would like to make.

1) The compressed-uuencoded-compressed file is almost 20% larger than the
compressed file, therefore you have *increased* my phone bills by 20%.  I
do not exactly appreciate this.

2) You have increased both the amount of disk space and the time required
for me to determine if this program is useful to me.  First I have to
uudecode it, then I have to uncompress it, and then I have to un-tar it.
Each of these steps require disk space and time.  With a shar posting I can
read the entire source before I even save it into my directory.  I can also
unpack a multi-part shar file one piece at a time and then remove the piece
that I just un-shar'ed thus greatly reducing the disk space requirements.

3) "tar" format is a lot less portable than "shar" format.  With a shar file
I can edit the file names if they are too long for my system V based machine.
Try doing that with a tar file.  "shar" format can be unpacked on a lot of
different systems other than just UNIX.  People these days are using your
programs in ways you never envisioned and on systems you never envisioned.
Even if a program is not really applicable to a particular environment,
there are often portions of the program that can be borrowed and used in
other applications.

4) With a "shar" format posting I can decide if something is useful before
I have all of the pieces.  If I then miss one or more pieces I can request
them from somewhere knowing that they are useful to me.  With a uuencoded
tar file I need to have all of the pieces before I can decide if it is
really useful to me.  I know some people will say "but I precede the posting
with a description of what it is" but this is not good enough.  Unless the
description you post exactly matches the description I have been thinking
of for something that I want then I cant really tell if this will be useful
to me.  There is no substitute for reading through all of the documentation
supplied and reading through a good portion of the source code.  Besides
that, what if the one piece of your posting that I miss is the first one and
therefore I never see your description of what it is.

In my opinion there are just too many arguments against posting uuencoded
tar files to even consider it as a viable alternative to shar files.  The
only reason I can see for uuencoding something is if it is a binary or if
it contains binary data.  Even then you should just uuencode that one item
and include it in a shar file with the plain text documentation.

Please do not post uuencoded tar files!  If you are concerned about your
program being modified as it is transmitted through BITNET then make sure
your source code is portable enough to withstand this.  You could also
try including checksums in your postings using the "snefru" package that
was posted recently.
-- 
David H. Brierley
Home: dave@galaxia.Newport.RI.US    {rayssd,xanth,att} !galaxia!dave
Work: dhb@quahog.ssd.ray.com        {uunet,sun,att,uiucdcs} !rayssd!dhb

overby@plains.UUCP (Glen Overby) (07/11/90)

In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) charges:
>We have recently seen a spate of "source" postings in "uuencoded
>compressed TAR" form, instead of SHAR or other traditional plain text
>formats.

to which <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) confesses:
> I'm certainly guilty of posting articles in *.tar.Z.uue format.  I'm not
> entirely happy with it, but I believe there are some valid reasons for
> using this ugly format...

> When I transmit a file, I want it to be received unchanged.  If it must be
> translated to suit the receiver's environment, then that translation should
> be done explicitly by the reciever, not magically by some machine halfway
> between here & there.

Both Steve and I are Frequent Flamers on comp.os.minix.  This group
is gatewayed to a LISTSERV list on that bastion of computer networks,
Bitnet.  I think we've all heard the rhetoric about what Bitnet does to
source files, but if not just ask any one of us who have been unfortunate
enough to have once been a BITNaut.

Every time someone posts a source file which is not uuencoded, they get
flamed by a dozen BITNauts who feel ripped off for not having gotten a good
copy.

But In article <1990Jul10.203015.27282@eci386.uucp> woods@eci386.UUCP (Greg A. Woods
claims:
> [...]  It is rare
>to find sites mangling news which is only passing through these days.
>The translation usually occurs either during the storing of news, or
>in the retrieval by the newsreader.

I recall five parts of a Minix upgrade being munged last Christmas Eve (yes,
1989) between vu.nl and nodak.edu, and I think all of it's path was over the
Internet.

The rationalization for compressing is to compensate for the expansion
caused by uuencoding, whose rationalization is, in an acronym, BITNET.
Fix the Bitnet problem, and you rid the world of most of the reasons for
uuencoding.

I offer one sugestion: for groups which are source-only, have the gateway
program pump everything thru 'compress | uuencode' before feeding it to
Listserv.  I still see no solution for discussion groups which also get
sources posted to them.

While I'm indicting comp.os.minix, I'd like to also charge comp.binaries.*
with a similar offense, using arc, zip or zoo instead of compress.

Other Solutions, anyone?
-- 
		Glen Overby	<overby@plains.nodak.edu>
	uunet!plains!overby (UUCP)  overby@plains (Bitnet)

lyndon@cs.AthabascaU.CA (Lyndon Nerenberg) (07/12/90)

In article <4187@taux01.nsc.com> amos@taux01.nsc.com (Amos Shapir) writes:

>The point was, what happens to those who use the BITNET/EBCDIC machines?
>The fact is, they are more numerous than UNIX users.

Do you have any hard data to back that statement up?

-- 
     Lyndon Nerenberg  VE6BBM / Computing Services / Athabasca University
         {alberta,cbmvax,mips}!atha!lyndon || lyndon@cs.athabascau.ca
                           Practice Safe Government
                                 Use Kingdoms

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (07/12/90)

In article <987@galaxia.Newport.RI.US> dave@galaxia.Newport.RI.US (David H. Brierley) writes:

>  If you are concerned about your
>program being modified as it is transmitted through BITNET then make sure
>your source code is portable enough to withstand this.

This isn't easy, especially if you are distributing patches. This is
why comp.os.minix distributes patches as .tar.Z.uu's.

--
"Perhaps I'm commenting a bit cynically, but I think I'm qualified to."
                                              - Dan Bernstein

roy@cs.umn.edu (Roy M. Silvernail) (07/12/90)

overby@plains.UUCP (Glen Overby) writes:

> While I'm indicting comp.os.minix, I'd like to also charge comp.binaries.*
> with a similar offense, using arc, zip or zoo instead of compress.

In defense of c.b.ibm.pc, the use of arc/zip/zoo is appropriate, since
these are widely available archivers for MS-DOS platforms. The contents
of c.b.i.p are binaries _for_ MS-DOS platforms. Frequently, these
postings are collections of files, rather than single files. To use
compress, you would also have to devise a way to assemble multi-file
postings, and they would _still_ have to be uuencoded. (shars of uuencoded
compressed files? yuck!) Certainly, compress for MS-DOS machines is
available, but arc/zip/zoo are more appropriate. Also, all 3 formats can
be unpacked on a Unix box (and arc and zoo files can be assembled, as
well), so non-DOS types can peek at things they cannot use.

There is a distinct difference between source postings and binary
postings. Binaries should properly be packed in a manner appropriate to
the target platform, and sources should be left as transparent as
possible (i.e. shars).

Just my $0.022, adjusted for inflation.
--
    Roy M. Silvernail   |   "It won't work... I have an  | Opinions found
    now available at:   |   exceptionally large mind."   | herein are mine,
 cybrspc!roy@cs.umn.edu | --Marvin, the paranoid android | but you can rent
(cyberspace... be here!)|                                | them.

flee@guardian.cs.psu.edu (Felix Lee) (07/12/90)

On a different note,

	A group I worked in made the surprising discovery that
	uuencode, a utility traditionally used to convert binary files
	to a printable form to pass through mailers, is a utility to
	"encode a binary file into a different binary file."

	[Randall Howard <rand@mks.com>
	 usenet <620@longway.TIC.COM>
	 comp.std.unix, 4 Apr 1990]

Sending uuencoded files through BITNET is by no means safe.  One
common munge is stripping trailing blanks.  This, at least, is
relatively easy to recover from.
--
Felix Lee	flee@cs.psu.edu

jim@anacom1.UUCP (Jim Bacon) (07/13/90)

In article <5256@plains.UUCP> overby@plains.UUCP (Glen Overby) writes:
>In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) charges:
>>We have recently seen a spate of "source" postings in "uuencoded
>>compressed TAR" form, instead of SHAR or other traditional plain text
>>formats.
>
[stuff deleted]
>
>to which <3114@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) confesses:
>> I'm certainly guilty of posting articles in *.tar.Z.uue format.  I'm not
>> entirely happy with it, but I believe there are some valid reasons for
>> using this ugly format...
>
[stuff deleted]
>
>I offer one sugestion: for groups which are source-only, have the gateway
>program pump everything thru 'compress | uuencode' before feeding it to
>Listserv.  I still see no solution for discussion groups which also get
>sources posted to them.
>
>While I'm indicting comp.os.minix, I'd like to also charge comp.binaries.*
>with a similar offense, using arc, zip or zoo instead of compress.
>
>Other Solutions, anyone?

I have been involved with the with FIDOnet on MSDOS for a good number of
years and have suffered the impact of changing "stanadards" for
compression methods.

At the start, ARC was the standard.  Then PKARC came along.  That wasn't
much of a problem, but did cause some confusion.  Then the lawsuits
started flying and we were buried under a slew of new programs, ZIP,
ZOO, and a half dozen others.

Now, I never know what to expect thru the network and about half of my
mail gets lost because I have taken the position that ARC is the
standard on my machine.

I would strongly urge that only a single compression utility be used as
a standard, and for UN*X I would suggest compress.


-- 
Jim Bacon                            | "A computer's attention span is only
Anacom General Corp., CA             |  as long as its extension cord."
jim@anacom1.cpd.com                  |
zardoz!anacom1!jim                   |                                 Anon

kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) (07/13/90)

In article <5256@plains.UUCP> overby@plains.UUCP (Glen Overby) writes:
>In article <15652@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) charges:
>>We have recently seen a spate of "source" postings in "uuencoded
>>compressed TAR" form, instead of SHAR or other traditional plain text
>>formats.
>
>Every time someone posts a source file which is not uuencoded, they get
>flamed by a dozen BITNauts who feel ripped off for not having gotten a good
>copy.
>
>
>I offer one sugestion: for groups which are source-only, have the gateway
>program pump everything thru 'compress | uuencode' before feeding it to
>Listserv.  I still see no solution for discussion groups which also get
>sources posted to them.
>
>Other Solutions, anyone?

Here's an idea: Lets compromise!  Come up with a format that really works!

We should be able to come up with a protocol that combines the safety of
uuencoding with the readability of shar archives.  Some features I would
like to see in the ultimate USENET archive format are:

1) The archive should be plain-text.  That is, each text file in the archive
   should be easy to locate within the archive, and it should be readable
   without the need to extract it.

2) The format would only be used to combine several text files into a single
   text file.  If you really must include a non-text file, then uuencode
   that one file.

3) Archives should begin with a table of all printable ASCII characters,
   so we can tell when transliteration has gone awry.

4) The archive program should split long lines when the archive is created,
   and rejoin them during extraction.

5) Tabs should be expanded to spaces.  The extraction program should convert
   groups of spaces back into tabs.

6) The program that creates the archive should give a warning message when
   a file's whitespace is likely to be reformated.  For example, spaces at
   the end of a line are a no-no.

7) The extraction program should be clever enough to ignore news headers and
   other introductory text, just for the sake of convenience.

8) It should be possible to embed one archive inside another.  This ability
   probably wouldn't see much use, but lack of the ability could sure be a
   nasty surprise to somebody.  "What?  You mean it only works on *some*
   text files?"

9) Should we use trigraphs for some of the more troublesome ASCII characters?
   The extraction utility could convert them back into real characters.

Did I miss anything?  Did I get anything wrong?  Does anybody know of an
existing format that comes close to these specs?
-------------------------------------------------------------------------------
Steve Kirkendall    kirkenda@cs.pdx.edu    uunet!tektronix!psueea!eecs!kirkenda

thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) (07/13/90)

dave@galaxia.Newport.RI.US (David H. Brierley) writes:
>In article <1990Jul10.182546.26487@diku.dk> thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) writes:
>>	name	       size		crummy ASCII graphics
>>	----------  -------		---------------------
>>	tar	    4718592	tar	 ------- -60.3% ------>	tar.Z
>>	tar.Z	    1874378	+37.8%				 +37.8%
>>	tar.Z.uu.Z  2229065	tar.uu.Z -------  -6.8% ------>	tar.Z.uu.Z

>1) The compressed-uuencoded-compressed file is almost 20% larger than the
>compressed file, therefore you have *increased* my phone bills by 20%.  I
>do not exactly appreciate this.

1) As I wrote, IF you have to post uuencoded material, it should
probably be compressed first. I also wrote that I agree with all the
other reasons the original poster gave to AVOID posting uuencoded
stuff.

I'm not advocating that people waste your bandwith by uuencoding
stuff, I'm trying to prevent a mistaken argument from making people
always post uuencoded stuff non-compressed --- because that often uses
even more bandwith, and almost always uses much more disk space.

Compressing before uuencoding often saves 60% on disk and 5-10% on the
wire --- but sometimes it will only save ~5% on disk and _waste_ ~20%
on a compressed link (some Sun run-length-encoded rasterfiles behave
that way). The poster should try to find out how each of his files
behaves, and pack each of them in the cheapest way; as ``bandwith on
compressed links'' seems to be the most popular cost metric, cheapest
probably means ``smallest after compression''. And then make a shar
archive of the packed files, so people can decide which they want to
unpack.

Another problem with this: The result of compressing a single file may
be very misleading when we really want to know how much larger it
makes a compressed batch of news articles. Compress is a very stateful
representation, and in a given batch it may not be able to compress a
uuencoded file nearly as much as when taken alone. So even the worst
rasterfile example may not affect the size of a batch as much as the
numbers lead one to believe. (Normally, compress gets ~13% after any
uuencode; in these examples, it gets ~30% after uuencode, but only the
usual ~13% after compress-uuencode. In the middle of a batch, the
difference might shrink a lot --- possibly to the point where
compress-uuencode wins again because it starts out 5% smaller.)


2) I hope you realize that a tar achive has binary file headers and
cannot be posted without some sort of encoding, so your 20% are not
immediately applicable. However, anybody who uuencodes something which
would have got through news as well without encoding deserves your
scorn and anger (and in my opinion, this includes anybody who posts a
tar archive consisting of ASCII files).

And I don't understand why ASCII/EBCDIC problems should be an excuse
for uuencode, either. The format uses the ASCII characters '!', '['
and ']', which are among those I've most often seen altered in
ASCII->EBCDIC->ASCII translations. If a uuencoded file gets through
unscathed, odds are that any printable ASCII file would. But maybe
somebody wrote a uudecode which takes input in EBCDIC and outputs in
ASCII?

--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcsun!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.      thorinn@diku.dk

tale@cs.rpi.edu (David C Lawrence) (07/13/90)

In <3124@psueea.UUCP> kirkenda@eecs.cs.pdx.edu (Steve Kirkendall) writes:

   5) Tabs should be expanded to spaces.  The extraction program should convert
      groups of spaces back into tabs.

... only if that is from whence they came.  I extremely rarely use
tabs.  Hate 'em, in fact.  You'd have a different copy of the sources
if you just changed all groups of spaces back to tabs based on some
pre-conceived notion of what a tab width is ("8 spaces" is not always
the right answer).  In some cases this could be VERY important if you
did it inside some literal that was important to the code.  (In the
cases of patches it is not as important because patch(1) does have a
flag to ignore these sorts of differences when checking to see that an
update is right, and you could always warn people to use it.)

   6) The program that creates the archive should give a warning message when
      a file's whitespace is likely to be reformated.  For example, spaces at
      the end of a line are a no-no.

I don't think this adequately addresses the above concern.
-- 
   (setq mail '("tale@cs.rpi.edu" "tale@ai.mit.edu" "tale@rpitsmts.bitnet"))

woods@robohack.UUCP (Greg A. Woods) (07/13/90)

In article <5256@plains.UUCP> overby@plains.UUCP (Glen Overby) writes:
> The rationalization for compressing is to compensate for the expansion
> caused by uuencoding, whose rationalization is, in an acronym, BITNET.
> Fix the Bitnet problem, and you rid the world of most of the reasons for
> uuencoding.

I would suggest that the BITNauts (as you so aptly called them!),
since they are the most common complainers, and the ones who are
affected first, should be the best ones to lobby the powers that be
in the BITNET towers of power.

It shouldn't really be that hard.  I'd bet the biggest problem is the
lack of co-ordination between versions of software on each end of the
link.  A little co-operative administration, and we wouldn't be
discussing all this nonsense!
-- 
						Greg A. Woods

woods@{robohack,gate,eci386,tmsoft,ontmoh}.UUCP
+1 416 443-1734 [h]   +1 416 595-5425 [w]   VE3-TCP   Toronto, Ontario; CANADA

bengtl@maths.lth.se (Bengt Larsson) (07/13/90)

In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:

(a list of valid points concerning a "shar" program)

>Did I miss anything?  Did I get anything wrong?  Does anybody know of an
>existing format that comes close to these specs?

Hmmm, one way to do it would be to write a little "unpacker" program (in C),
and distribute it with the archive (in plain text).

Suggested format for archive: (borrowed heavily from VMS_SHAR, the "shar"
program for VMS)

(unpacker program (optional) in plain text here. Let's call it "unpacker.c"
 For those who don't have it, extract it from here, compile it with 
 "cc -o unpacker unpacker.c" and start unpacking)

-- start part 1 --
file packer.txt 744 23642334
X The filename is on a line started by "file", followed by one space,
X followed by the filename, a space, a (Unix) protection code (octal, 
X like for "uuencode") and a checksum. The filename must not contain a space.
X 
X The archived file is mostly normal text. Control characters are escaped 
X with a backtick followed by three characters with the decimal (octal?) 
X value of the escaped character. Like `009 for a tab. The backtick 
X is itself escaped like `096.
X 
X Long lines are folded. Normal lines start with an "X". Continuation
V lines (like this one) start with a "V" (that is, newlines are to be skipped
V before "V").
X
X Since all lines start with a special character, it is possible to
X archive archives (the archived file ends with a line not starting with
X "X" or "V").
X
X Trailing blanks are escaped, just like control characters. 
X Trailing blanks which result from splitting a long line are also
X escaped. When run through the unpacker, all trailing spaces are 
X stripped first (trailing blanks may have been added somewhere).
X
X This is a line with some trailing spaces...        `032
-- end part 1 --

Anything may come here (News headers, for example).

We start the next part with a line which start with "-- start part 2".
Note that the headers etc. may be in the middle of a file. All parts
in the archive have the same length. Archived files are split 
routinely between parts.

All the unpacker has to do is to look for a line starting with 
"-- end part xx" and then skip to a line beginning with "-- start part xx+1".
The unpacker may (should?) check that the "xx" numbers are correct and
in sequence.

-- start part 2 --
X 
X Now we can say something about directories. Lets start a new file
X "subdir.txt" in a subdirectory "doc".
directory doc 744
file doc/subdir.txt 744 2353453
X 
X Now we are in the subdirectory. A directory is created by a line
X started by "directory". The subdirectory may already exist (that is no
X error). Anyway, the protection code is specified like for files.
X 
X When files in a subdirectory are specified, directory parts are separated
X by "/" (like in Unix). This should make it possible to write unpackers
X for other environments (for example VMS).
X
X Let's say that the archive should be terminated with a line
X "end archive".
X
-- end part 2 --
end archive

The unpacking program could be run like:

  % unpacker prog.pck.01 prog.pck.02 prog.pck.03 ...
  
  or (Unix)
  
  % cat prog.pck.?? | unpacker.

What do you think? The idea was that the "packer" program may be somewhat
complex, but the "unpacker" should be small (could be distributed with
the archive in plain text). The "packer" could accept lots of options
(for example, which characters to escape, the maximum line length, the 
maximum part size, maybe maximum length for filenames etc.). Reasonable
defaults should be provided. 

I think the "packer" should default to the "safest" format (escaping
tabs and special characters for Bitnet). If the escaping mechanism
is turned off, this is just a file splitter/extractor (may be used
to split uuencoded GIF files, for example :-)

Bengt Larsson.
-- 
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: bengtl@maths.lth.se             SUNET:    TYCHE::BENGT_L

gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) (07/13/90)

In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:

>Here's an idea: Lets compromise!  Come up with a format that really works!

Ok, let's set a goal: the files should unpack totally unchanged.

>3) Archives should begin with a table of all printable ASCII characters,
>   so we can tell when transliteration has gone awry.

This is good, although if two characters are mapped into one, the
process has failed. Formats like btoa avoid this because not many
computers map several English alphabetic characters into one. 

>5) Tabs should be expanded to spaces.  The extraction program should convert
>   groups of spaces back into tabs.

How can you do this and preserve the original file? What if the
original file has a bunch of spaces in a row?

>6) The program that creates the archive should give a warning message when
>   a file's whitespace is likely to be reformated.  For example, spaces at
>   the end of a line are a no-no.

They may be a no-no, but what if you're transmitting context diffs on
files which used to have excess spaces... you could presumably think
up other situations in which you'd want trailing spaces to be
preserved.

>9) Should we use trigraphs for some of the more troublesome ASCII characters?
>   The extraction utility could convert them back into real characters.

You'd definately need some way of signaling special characters. Then
you could mark tabs and fake ends-of-lines in order to prevent spaces
from getting eaten. But then we'd have to figure out every single
special character that is at risk for being munged -- {} [] $ |\

By the time you get done, it might be rather hard to read.

--
"Perhaps I'm commenting a bit cynically, but I think I'm qualified to."
                                              - Dan Bernstein

tp@mccall.com (07/13/90)

In article <1990Jul13.022224.25441@lth.se>, bengtl@maths.lth.se (Bengt Larsson) writes:
> In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
> 
> (a list of valid points concerning a "shar" program)
> 
>>Did I miss anything?  Did I get anything wrong?  Does anybody know of an
>>existing format that comes close to these specs?
> 
> Hmmm, one way to do it would be to write a little "unpacker" program (in C),
> and distribute it with the archive (in plain text).

It had better be a VERY portable program!

> Suggested format for archive: (borrowed heavily from VMS_SHAR, the "shar"
> program for VMS)

I agree that the VMS_SHARE format is quite good. However, the VMS_SHARE
format is self unpacking on VMS, just like a unix shar is on unix, with no
tools that are not part of the OS. The VMS_SHARE is a DCL command procedure
that contains a TPU program (TPU is the VMS programmable text editor) to
unpack the files. 

The problem with a C program, is that it is VERY hard to write a portable
program with no #ifdef's that will do the job. If you go this route, write
it strictly as a filter, and invoke it just like you do sed in current
shar's, with the <<'EOF' input specifier and the > redirection for output.
And for heaven's sake, ONLY WRITE ONE OF THEM, and use only one name for
it! Whatever you write, I have to recognize explicitly (I maintain a VMS
unshar program, and I'm not interested in making it into a full Bourne
shell.)

Perhaps this is overkill? Wouldn't it be possible to escape the most
troublesome characters in such a way that you could still use sed to unpack
it? Anyone currently unpacking unix shar's has already emulated sed to some
degree, adding a few more substitute commands couldn't be hard. I don't advocate using AWK, while there is a
VMS version, it is large and not widely installed. I suspect MSDOS or Amiga
sites would have similar problems. 

Final note about the other proposed format, DON'T mung spaces into tabs.
Some people don't use tabs. The goal should be to do as good a job as
possible in reproducing the exact file that was packed.
-- 
Terry Poot <tp@mccall.com>                The McCall Pattern Company
(uucp: ...!rutgers!ksuvax1!mccall!tp)     615 McCall Road
(800)255-2762, in KS (913)776-4041        Manhattan, KS 66502, USA

chip@tct.uucp (Chip Salzenberg) (07/13/90)

According to overby@plains.UUCP (Glen Overby):
>Every time someone posts a source file which is not uuencoded, they get
>flamed by a dozen BITNauts who feel ripped off for not having gotten a good
>copy.

As far as I'm concerned, if they can't be troubled to translate ASCII
without lossage, munged sources are their own problem.
-- 
Chip, the new t.b answer man      <chip@tct.uucp>, <uunet!ateng!tct!chip>

new@udel.EDU (Darren New) (07/13/90)

This sounds good.  As long as we are playing with "standard" formats,
we might want to consider a new version of UUENCODE that avoids
any characters that are not available in EBCDIC.  I have one that
uses only +-*/ 0-9 A-Z a-z and has checksums on each line.

Other than that, the new shar format looks good.
I would suggest making restrictions on the characters that
can appear in filenames (like 14 chars or less, no spaces,
colons, backslashes, [, ], only one period, no semicolons,
and so on).  These could be enforced by default in the
packer and turned off only after dire warnings.

Incidentally, why Unix octal protection bits?  Why not
  RWED (for read, write, execute, delete) 
or some other non-Unix semantics? If it's going to a
Unix machine, a shell script could be put in the archive 
to properly set the protections. If you need particular
UNIX protections, you probably also need particular owners
and groups and such information would be nice to know for
those not running under UNIX.    -- Darren

peter@ficc.ferranti.com (Peter da Silva) (07/13/90)

In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
> Here's an idea: Lets compromise!  Come up with a format that really works!

I've suggested this before... the software tools format.

> 1) The archive should be plain-text.  That is, each text file in the archive
>    should be easy to locate within the archive, and it should be readable
>    without the need to extract it.

Headers and tailers are marked by "-h-" and "-t-". Other sequences could
be added, like "-d-" for directories.

> 2) The format would only be used to combine several text files into a single
>    text file.  If you really must include a non-text file, then uuencode
>    that one file.

Exactly.

> 3) Archives should begin with a table of all printable ASCII characters,
>    so we can tell when transliteration has gone awry.

That's a nice enhancement.

> 4) The archive program should split long lines when the archive is created,
>    and rejoin them during extraction.

Not currently supported, but see below.

> 5) Tabs should be expanded to spaces.  The extraction program should convert
>    groups of spaces back into tabs.

No. Tabs should be converted to a unique escape sequence.

> 6) The program that creates the archive should give a warning message when
>    a file's whitespace is likely to be reformated.  For example, spaces at
>    the end of a line are a no-no.

No, spaces at the end of a line should be marked.

> 7) The extraction program should be clever enough to ignore news headers and
>    other introductory text, just for the sake of convenience.

Anything not between "-h-" and "-t-" can be safely ignored.

> 8) It should be possible to embed one archive inside another.  This ability
>    probably wouldn't see much use, but lack of the ability could sure be a
>    nasty surprise to somebody.  "What?  You mean it only works on *some*
>    text files?"

Leading dashes are escaped with another dash.

> 9) Should we use trigraphs for some of the more troublesome ASCII characters?
>    The extraction utility could convert them back into real characters.

Yes, but not trigraphs. A two-character sequence should be enough... how
about "@x" for some value of x? @t would be tab, @! would be |, and so on.
Of course "@@" would be "@".

Begin *all* lines between -h- and -t- with X, or C if it's a continuation
of the previous line. Trailing spaces would have a "@" appended. (of course,
some other escape character could be used... Kernighan and Pike use "@" for
other software tools tools, is all.).

Or how about this: begin each line with T for text, C for continued text,
and M for uuencoded lines?
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.
<peter@ficc.ferranti.com>

darcy@druid.uucp (D'Arcy J.M. Cain) (07/13/90)

In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>Here's an idea: Lets compromise!  Come up with a format that really works!
>
Good idea bu I think it involves fixing things in different places rather
than trying to come up with a new format.  No one is going to switch to
a completely new format.

>1) The archive should be plain-text.  That is, each text file in the archive
>   should be easy to locate within the archive, and it should be readable
>   without the need to extract it.
Already covered by shar

>2) The format would only be used to combine several text files into a single
>   text file.  If you really must include a non-text file, then uuencode
>   that one file.
Shar again

>3) Archives should begin with a table of all printable ASCII characters,
>   so we can tell when transliteration has gone awry.
Shar can be modified to put something in comments.

>4) The archive program should split long lines when the archive is created,
>   and rejoin them during extraction.
Modify shar to do this.  All it would take is to put a '\' if the line goes
beyond a certain point and continue on the next line.  The shell will put it
back together properly.

>5) Tabs should be expanded to spaces.  The extraction program should convert
>   groups of spaces back into tabs.
Bad idea.  What if the original file had no tabs in it?  The extraction program
would change multiple spaces to tabs thus doing what you are trying to avoid.

>6) The program that creates the archive should give a warning message when
>   a file's whitespace is likely to be reformated.  For example, spaces at
>   the end of a line are a no-no.
Could be added to shar if desired.

>7) The extraction program should be clever enough to ignore news headers and
>   other introductory text, just for the sake of convenience.
Nice addition but it is also nice to be able to use the shell which everyone
has.  Non Unix boxes that emulate the work of the shell with an extraction
tool can add this if desired and a preprocessor can be added to Unix.  I
don't see this as being a major 'wannit' though.

>8) It should be possible to embed one archive inside another.  This ability
>   probably wouldn't see much use, but lack of the ability could sure be a
>   nasty surprise to somebody.  "What?  You mean it only works on *some*
>   text files?"
I have never tried it but I don't believe this is a problem with shar.

>9) Should we use trigraphs for some of the more troublesome ASCII characters?
>   The extraction utility could convert them back into real characters.
Again shar can be banged on a little to handle this.  Simply change the
sed command that it generates to translate trigraphs to the proper character.
Then shar can convert all troublesome characters and it will be converted back
when the script is run.  Create a temporary sed script at the start of the
script and use it for all the files.  That way sites that have problems
with some characters can modify the script before running it so that they
stay as trigraphs.  This has the added benefit of doing automatic trigraph
conversion at sites that require it.  Is there a trigraph for tabs?  If not
just invent one and you can handle the tab problem as well.

>Did I miss anything?  Did I get anything wrong?  Does anybody know of an
>existing format that comes close to these specs?
shar. :-)

if people think there are some good ideas here I am willing to do some
hacking on shar/unshar to implement some of this stuff.

-- 
D'Arcy J.M. Cain (darcy@druid)     |   Government:
D'Arcy Cain Consulting             |   Organized crime with an attitude
West Hill, Ontario, Canada         |
(416) 281-6094                     |

jejones@mcrware.UUCP (James Jones) (07/13/90)

In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>Here's an idea: Lets compromise!  Come up with a format that really works!
>
>We should be able to come up with a protocol that combines the safety of
>uuencoding with the readability of shar archives.  Some features I would
>like to see in the ultimate USENET archive format are:
[list of constraints edited out]

Actually, Eric Tilenius, former moderator of the BITNET CoCo mailing list,
came up with a fairly reasonable way of encoding files so that they could
survive the ASCII-EBCDIC meatgrinder.  It doesn't bother letters, digits,
and spaces, so that (1) shar files fed through it are still pretty readable
and (2) compression on the output would still pay off.  This format, which
goes by the name CUTS (dunno what that's an acronym for), might be a usable
one.  If there's interest (express it via email, please!), I will go digging
for the description.

	James Jones

peter@ficc.ferranti.com (Peter da Silva) (07/14/90)

In article <24445@estelle.udel.EDU> new@ee.udel.edu (Darren New) writes:
> Incidentally, why Unix octal protection bits?  Why not
>   RWED (for read, write, execute, delete) 

(you mean Read Write Extend Delete)

Why not? This is Usenet, and any other set of bits is going to be equally
bogus. How about RWDHSAEX (Read Write Delete Hidden Script Archived Extend
Execute) while you're about it? About the only bits worth keeping are
write-protect and script/execute. Anything else is subject to local policy.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.
<peter@ficc.ferranti.com>

rsalz@bbn.com (Rich Salz) (07/14/90)

In <4187@taux01.nsc.com> amos@taux01.nsc.com (Amos Shapir) writes:
>The point was, what happens to those who use the BITNET/EBCDIC machines?
>The fact is, they are more numerous than UNIX users.

Prove it.

	/rich $alz
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

rsalz@bbn.com (Rich Salz) (07/14/90)

Make the most use of READER's time.  Posting uudecode'd compressed sources
means that nice little BBS site might get a cheaper phonebill.  (Note the
word MIGHT!)  On the other hand, 20 people will now be running uudecode
and uncompress.

The alternative -- posting clean human-readable C code -- means people just
spend a little bit of time in their newsreading, reading code.
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

bengtl@maths.lth.se (Bengt Larsson) (07/14/90)

In article <3119.269d97ea@mccall.com> tp@mccall.com writes:

>I agree that the VMS_SHARE format is quite good. However, the VMS_SHARE
>format is self unpacking on VMS, just like a unix shar is on unix, with no
>tools that are not part of the OS. The VMS_SHARE is a DCL command procedure
>that contains a TPU program (TPU is the VMS programmable text editor) to
>unpack the files. 

Yes, that's the big advantage (having TPU around as a standard (although
there were some problems with the "standard" from VMS 4.x to VMS 5.x!)).

I must confess that I'm not much of a C-programmer, so I really can't say
if the "unpacker" could be written portably. It just seemed to me that
except for sed, awk, sh etc. the most portable thing on Unix would be a 
small C program. But, as I said, I'm no expert on portable C.

>The problem with a C program, is that it is VERY hard to write a portable
>program with no #ifdef's that will do the job. If you go this route, write
>it strictly as a filter, and invoke it just like you do sed in current
>shar's, with the <<'EOF' input specifier and the > redirection for output.
>And for heaven's sake, ONLY WRITE ONE OF THEM, and use only one name for
>it! Whatever you write, I have to recognize explicitly (I maintain a VMS
>unshar program, and I'm not interested in making it into a full Bourne
>shell.)

Hmmm, I can imagine the problems with writing an "unshar" for VMS
(I'm more familiar with VMS than Unix).

As you say, the most important is that whatever is, is a _standard_.
A rigidly standardized version of "shar" would do as well.

And maybe it would be useful to use the <<'EOF' method for feeding parts
to the unpacker. Not so pretty, though :-)

>Perhaps this is overkill? Wouldn't it be possible to escape the most
>troublesome characters in such a way that you could still use sed to unpack
>it? Anyone currently unpacking unix shar's has already emulated sed to some
>degree, adding a few more substitute commands couldn't be hard. I don't 
>advocate using AWK, while there is a
>VMS version, it is large and not widely installed. I suspect MSDOS or Amiga
>sites would have similar problems. 

Maybe it is overkill. But what about folded long lines? Can that be unpacked
with sed? Substituting the most important characters (for example tab) would
be doable in sed, I think.

>Final note about the other proposed format, DON'T mung spaces into tabs.

Agreed. Tabs should be preserved.

Anyway, I hope my proposed format made some food for thought, especially for
Unix people. It would be much more portable to different systems than
the current versions of "shar". VMS_SHARE is certainly something to be
inspired by.


Summary of features in VMS_SHARE not present in "shar"s (at least not all 
of them):

  1. Escaping of all characters which are a) not printable ascii, b)
     likely to be munged by Bitnet.
  
  2. Folding of long lines. 
  
  3. Automatic skipping of News headers and such.
  
  4. Checksums as standard (This uses the verb CHECKSUM which comes with
     VMS).
     
  5. Archived files are routinely split between archive parts, to keep each
     part a standard size. This of course also handles files bigger than
     any of the archive parts.


Features of my proposed format (advantages relative to "shar"):

  1. Much more portable to different architectures (not just Unix).
  
  2. It's easy to find the file names, since they are on lines starting
     with "file".
  
  3. A standard checksum built in (maybe not a CRC, but something
     more powerful than a character count).
  
  4. Much more protection against character munging through character
     escapes.
  
  5. Automatic skipping of News headers and such when unpacking.
  
  6. Handles splitting of files between archive parts routinely. Handles
     achiving of files bigger than any archive part..


Disadvantages relative to "shar":

  1. Slightly less readable, especially if many characters are escaped.
  
  2. You must have an "unpacker" compiled (it may be distributed with
     the archive).

I'm sorry that I'm not much of a C programmer: I will not be implementing 
this myself.

Bengt Larsson.
-- 
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: bengtl@maths.lth.se             SUNET:    TYCHE::BENGT_L
-- 
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: bengtl@maths.lth.se             SUNET:    TYCHE::BENGT_L

bengtl@maths.lth.se (Bengt Larsson) (07/14/90)

In article <24445@estelle.udel.EDU> new@ee.udel.edu (Darren New) writes:

>Other than that, the new shar format looks good.
>I would suggest making restrictions on the characters that
>can appear in filenames (like 14 chars or less, no spaces,
>colons, backslashes, [, ], only one period, no semicolons,
>and so on).  These could be enforced by default in the
>packer and turned off only after dire warnings.

Hmmm, this might be useful. I think a warning would do: "Warning:
non-portable filename, more than 14 chars ("some-long-file-name")".

It seems psychologically right. Nobody likes to be given a lot of
warnings, even if the program accepts the input.

>Incidentally, why Unix octal protection bits?  Why not
>  RWED (for read, write, execute, delete) 

Well, I borrowed them from "uuencode", that's why. The format
is primarily Unix-based, and it shows.

I'm not sure it was such a good idea with protection bits, though.
Maybe they don't belong there (I may not want the expanded files to
be publicly readable, for example).

>or some other non-Unix semantics? If it's going to a
>Unix machine, a shell script could be put in the archive 
>to properly set the protections. If you need particular
>UNIX protections, you probably also need particular owners
>and groups and such information would be nice to know for
>those not running under UNIX.    -- Darren

I don't think it's common to use owners and groups and such when
unpacking a (text) archive. I just thought it might be useful to
be able to make a file executable (like the "Configure" file in 
Larry Walls programs). That's mainly why I included the protection bits.

Bengt Larsson.
-- 
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: bengtl@maths.lth.se             SUNET:    TYCHE::BENGT_L

wht@n4hgf.Mt-Park.GA.US (Warren Tucker) (07/15/90)

In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes:

> if people think there are some good ideas here I am willing to do some
> hacking on shar/unshar to implement some of this stuff.

Trust me :-), your mailbox will overflow with hate mail.  Fortunately
the scores of "letter bombs" I got when I hacked on shar a while
back were not real or my house would be gone and my yard would be
a large crater.

However, change is mandated once per century or so.  So, as risk
of losing my street number to a black hole, I'm posting shar 3.31.

flames > /dev/h2o
 
--------------------------------------------------------------------
Warren Tucker, TuckerWare emory!n4hgf!wht or wht@n4hgf.Mt-Park.GA.US
Sforzando (It., sfohr-tsahn'-doh).   A direction to perform the tone
or chord with special stress, or marked and sudden emphasis.

plocher@sally.Sun.COM (John Plocher) (07/16/90)

darcy@druid.uucp (D'Arcy J.M. Cain) writes:
>>9) Should we use trigraphs for some of the more troublesome ASCII characters?
>>   The extraction utility could convert them back into real characters.
>Again shar can be banged on a little to handle this.  Simply change the
>sed command that it generates to translate trigraphs to the proper character.
>Then shar can convert all troublesome characters and it will be converted back
>when the script is run.


Ok, so I generate a new-shar archive that looks like this:

	sed "s/???/}/" << EOF
	blah
	EOF

And I send it thru a machine that munges the '}' character.
The orig file won't extract correctly now, even with this
"smart" sed script tacked on, because now the sed script itself
is broken.  In fact, sed based triglyph translators can corrupt
"correct" text like this:

	Remind:  Should this get fixed ???
...and...
	/* How many cards are there ??*/
...etc...

  -John

tp@mccall.com (07/16/90)

In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes:
> In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>>4) The archive program should split long lines when the archive is created,
>>   and rejoin them during extraction.
> Modify shar to do this.  All it would take is to put a '\' if the line goes
> beyond a certain point and continue on the next line.  The shell will put it
> back together properly.

Technical question from one who maintains a VMS unshar util. Is this true?
What happens if you have a line of text in the actual file that ends with a
'\', which is common, for instance, in long #defines in C source? Will
current shar/unshar programs fold this into a single line? If not, what is
the difference with what you said above? 

Do any shar utilities out there do this? Anyone want to guess if supporting
it will cause more problems than it solves (assuming the conflict in the
previous paragraph)?
-- 
Terry Poot <tp@mccall.com>                The McCall Pattern Company
(uucp: ...!rutgers!ksuvax1!mccall!tp)     615 McCall Road
(800)255-2762, in KS (913)776-4041        Manhattan, KS 66502, USA

drd@siia.mv.com (David Dick) (07/16/90)

In <373@anacom1.UUCP> jim@anacom1.UUCP (Jim Bacon) writes:


>I would strongly urge that only a single compression utility be used as
>a standard, and for UN*X I would suggest compress.

It has been reported that Unisys owns a patent on Zempel-Liv or 
LZW compression (I don't remember which) and believes every current 
use of compress is in violation!

I think they are taking steps to deal with this.

So, if you think "compress" will avoid the legal morass that arose
in the ARC world, you may be in for a surprise.

David Dick
Software Innovations, Inc.  [the Software Moving Company (sm)]

leilabd@syma.sussex.ac.uk (Leila Burrell-Davis) (07/16/90)

In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:  
>Here's an idea: Lets compromise!  Come up with a format that really works!  
>  
>We should be able to come up with a protocol that combines the safety of  
>uuencoding with the readability of shar archives.

No one has yet mentioned Brad Templeton's abe, which was designed to
solve these problems, but has never achieved widespread usage.

I enclose the read.me from the package:
-----------------------------------------------------------------------

	ABE Ascii-Binary Encoding System by B. Templeton

ABE is a replacement for uuencode/uudecode designed to deal with all
the typical problems of USENET transmission, along with those of other
media.

Advantages are:
	Files are often smaller, and compress well.

	All printable characters map to themselves, so strings in
		binaries are readable right in the encoding.

	All lines are indexed, so sort(1) can repair any random
		scrambling of lines or files. (This can be turned off.)

	Extraneous lines (news headers, comments, signatures etc.) are
		ignored, even in the middle of encodings.

	A PD tiny decoder is available to include with files for first
		time users.

	Files can be split up automatically into equal sized blocks.

	Blocks can contain redundant information so that the decoder
		can handle blocks in any order, even with reposted duplicates
		and extraneous articles.

	Files with blank regions can be constructed from multi-part encodings
		with damaged blocks.

	Multiple files can be placed in one encoding.

	The decoder is extremely general and configurable, and supports many
	features not currently found in the encoder, but which other encoder
	writers might fight useful.

In general, a redundant ABE encoding posted to a typical newsgroup over a
certain article region can be decoded with something as simple as:
	
	dabe /usr/spool/news/comp/binaries/group/3[45]?

Where it doesn't matter much if there are postings in a random order,
duplicate postings, or inserted articles on other topics.   Ie. exactly
all the things that are a pain about usenet (or mail) binaries.
(You can usually run dabe right on your entire mailbox.)


The ABE encoder (and decoder) support 3 different encoding formats.  One
uses all 94 printable ASCII characters, the other avoids characters that
have trouble in ASCII-EBCDIC translations, and the 3rd is the UUENCODE
format.  (ABE can make files decodable by a typical uudecode program.)
-----------------
-- 
Leila Burrell-Davis, Computing Service, University of Sussex, Brighton, UK
Tel:   +44 273 678390              Fax:   +44 273 678470
Email: leilabd@syma.sussex.ac.uk  (JANET: leilabd@uk.ac.sussex.syma)

karl@haddock.ima.isc.com (Karl Heuer) (07/17/90)

In article <3131.26a188ca@mccall.com> tp@mccall.com writes:
>In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes:
>> Modify shar to [split long lines].  All it would take is to put a '\' if
>> the line goes beyond a certain point and continue on the next line.  The
>> shell will put it back together properly.
>
>Technical question ... Is this true?

No.  Since shar'd files are quoted (via the hack of quoting the word that
identifies the terminator for the here-document), a backslash is not special
there.  You'd have to use an unquoted here-document, and then escape any real
backslashes, dollar signs, or grave accents that appear in the text.

Karl W. Z. Heuer (karl@kelp.ima.isc.com or ima!kelp!karl), The Walking Lint

mykel@saleven.oz (Michael Landers) (07/17/90)

In article <1990Jul13.022224.25441@lth.se> bengtl@maths.lth.se (Bengt Larsson) writes:
>Hmmm, one way to do it would be to write a little "unpacker" program (in C),
>and distribute it with the archive (in plain text).

Although it doesn't come up to the illustrious specs as mentioned within the
above article, but if you have read the Obfuscated (sp?) C Contest winners,
there is a neat little unpacker writen in something like 1000 characters that
is an equivalent to "atob | zcat".  This way we can used btoa'd compress'd tars
and noone will even notice...

Mykel.
-- 
 ()				      \\     Black Wind always follows
|\/|ykel Landers  (mykel@saleven.oz)   \\    Where by dark horse rides,
_||_                                    \\   Fire is in my soul,
Phone: +612 906 3833 Fax: +612 906 2537  \\  Steel is by my side.

bxw@ccadfa.adfa.oz.au (Brad Willcott) (07/17/90)

With all of the gripes going on about this subject, I afraid that, as the
system that I am using is UNIX, and it has : compress, uncompress, tar, shar,
unshar, uudecode, uuencode, and MOST important of all, it has 'nn', I don't
have any probloems with receiving these sorts of files.  Providing of cause
that there is a decent description attached to the posting header.
NN is the news reader/poster that I use.  The user interface is quit nice,
considering that I am only using VT100 terminals.  It allows me to EASILY :
unshar or decode a posting into its relevant files, in ANY directory I wish.
It will even create the directory for me if it dosen't already exist.

So I would suggest to everyone who might be having trouble, to get a copy
of these utilities, and 'nn', then we can finish this discussion here!

PS:  An additional suggestion, would be to install LHARC on your systems,
then we can send/receive even smaller files :-)

-- 
Brad Willcott,                          ACSnet:     bxw@ccadfa.cc.adfa.oz
Computing Services,                     Internet:   bxw@ccadfa.cc.adfa.oz.au
Australian Defence Force Academy,       UUCP:!uunet!munnari.oz.au!ccadfa.oz!bxw
Northcott Dr. Campbell ACT Australia 2600  +61 6 268 8584  +61 6 268 8150 (Fax)

woods@robohack.UUCP (Greg A. Woods) (07/17/90)

In article <138944@sun.Eng.Sun.COM> plocher@sally.Sun.COM (John Plocher) writes:
> darcy@druid.uucp (D'Arcy J.M. Cain) writes:
> >Again shar can be banged on a little to handle this.  Simply change the
> >sed command that it generates to translate trigraphs to the proper character.
> >Then shar can convert all troublesome characters and it will be converted back
> >when the script is run.
> 
> Ok, so I generate a new-shar archive that looks like this:
>[blahhh....] 
> And I send it thru a machine that munges the '}' character.
> The orig file won't extract correctly now, even with this
> "smart" sed script tacked on, because now the sed script itself
> is broken.

But at least you only have to fix one [set of] line(s) which you know
all about, instead of a million possible unknown points in a source
file which got munged.

Too bad about the "extra" trigraph in the source.  Perhaps shar could
check for these, and certainly the sender could pack and unpack before
sending to validate the correctness.
-- 
						Greg A. Woods

woods@{robohack,gate,eci386,tmsoft,ontmoh}.UUCP
+1 416 443-1734 [h]   +1 416 595-5425 [w]   VE3-TCP   Toronto, Ontario; CANADA

gert@targon.UUCP (Gert Kanis) (07/17/90)

In article <24445@estelle.udel.EDU> new@ee.udel.edu (Darren New) writes:
>we might want to consider a new version of UUENCODE that avoids
>any characters that are not available in EBCDIC.  I have one that
>uses only +-*/ 0-9 A-Z a-z and has checksums on each line.
>
>-- Darren

You might be refering to xxencode (and xxdecode). This has been posted at
least a few times (I think).
Of course the reactions to yet another encoder that expands your files more
than uuencode are very unenthousiastic.
+------------------------------------------------------------------+
| No quote here to     | Gert Kanis, SWP BS                        |
| save net-bandwidth.  | Nixdorf Computer BV, Postbus 29           |
|----------------------| 4130 EA Vianen, Netherlands.              |
| I do not represent   | E-mail:{smart-mailer!} gert@targon.uucp   |
| anyone elses opinion.|   or {..uunet!} hp4nl.nluug.nl!targon!gert|
+------------------------------------------------------------------+

darcy@druid.uucp (D'Arcy J.M. Cain) (07/17/90)

In article <3131.26a188ca@mccall.com> tp@mccall.com writes:
>In article <1990Jul13.161441.8339@druid.uucp>, darcy@druid.uucp (D'Arcy J.M. Cain) writes:
>> In article <3124@psueea.UUCP> kirkenda@eecs.UUCP (Steve Kirkendall) writes:
>>>4) The archive program should split long lines when the archive is created,
>>>   and rejoin them during extraction.
>> Modify shar to do this.  All it would take is to put a '\' if the line goes
>> beyond a certain point and continue on the next line.  The shell will put it
>> back together properly.
>
>Technical question from one who maintains a VMS unshar util. Is this true?
>What happens if you have a line of text in the actual file that ends with a
>'\', which is common, for instance, in long #defines in C source? Will
>current shar/unshar programs fold this into a single line? If not, what is
>the difference with what you said above? 
>
You're right.  I usually try things out before posting statements like that
but I missed that one.  The text between "<< SHAR_EOF" and the end of input
text is of course not interpreted by the shell like the commands are.  The
shar program would need a little more work than I suggested above.

I am going to post my genfiles utility.  This is a pair of programs that I
use for generating ASCII files on multiple systems.  I will modify it to
include support for breaking long lines before posting it.  Perhaps the net
would like to look at this as a starting point for some kind of standard
distribution method.  I only have access to Unix and MSDOS so I can't say
if it is suitable for other systems.  It does allow stuff to be transmitted
in clear readable text.  Since I expect I will get many suggestions for
improvements should I post to both alt.sources and comp.sources or should
I hold the c.s posting till I have a more finished product?

-- 
D'Arcy J.M. Cain (darcy@druid)     |   Government:
D'Arcy Cain Consulting             |   Organized crime with an attitude
West Hill, Ontario, Canada         |
(416) 281-6094                     |

thorinn@skinfaxe.diku.dk (Lars Henrik Mathiesen) (07/18/90)

drd@siia.mv.com (David Dick) writes:
>It has been reported that Unisys owns a patent on Zempel-Liv or 
>LZW compression (I don't remember which) and believes every current 
>use of compress is in violation!

I think I have seen the paper describing that work. It described the
use of some sort of Lempel-Ziv encoding between a mainframe and a disk
subsystem --- the processor being so fast that it could compress one
block while the previous was being transferred, thus increasing
throughput.
  A patent based on that would probably be a process patent covering
the use of a specific algorithm (12 bit compress) in a particular part
of the I/O path in a disk system. If UniSys think it covers the use of
a PD compress program on a general purpose computer system, either
they really got a patent for the algorithm (which they oughn't be able
to) or they have misunderstood something.
  On the other hand, I think the W in LZW may be the author of that
paper, which might give them a claim on that version of the algorithm.

>I think they are taking steps to deal with this.

I hope those steps don't work.

--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcsun!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.      thorinn@diku.dk

mills@ccu.umanitoba.ca (Gary Mills) (07/18/90)

Expanding tabs in a shar is no good because it messes up context
diffs.  Encoding them is a better idea.  I use the following shar
from time to time because my Bitnet gateway (via UREP) expands tabs:

----------------8<---------------------8<-------------------8<--------
#!/bin/sh
# tshar: shell archiver that makes tabs visible
#
echo '#! /bin/sh'
echo '# This is a shell archive, meaning:'
echo '# 1. Remove everything above the #! /bin/sh line.'
echo '# 2. Save the resulting text in a file.'
echo '# 3. Execute the file with /bin/sh (not csh) to create:'
for file in "$@"
do
	echo "#	$file"
done
echo "# This archive created: `date`"
logname=${LOGNAME:-"unknown"}
fullname=`grep "^$logname" /etc/passwd | cut -d: -f5`
echo "# By:	$logname ($fullname)"
echo 'export PATH; PATH=/bin:/usr/bin:$PATH'
echo "tab=\`echo \" \" | awk '{printf \"\\\t\\\n\"}'\`"
tab=`echo " " | awk '{printf "\t\n"}'`
for file in "$@"
do
	size=`wc -c < $file`
	echo "echo shar: \"extracting '$file'\" '($size characters)'"
	echo "if test -f '$file'"
	echo "then"
	echo "	echo shar: \"will not over-write existing file '$file'\""
	echo "else"
	echo "	sed -e 's/^X//' -e \"s/\^I/\$tab/g\" << \\SHAR_EOF > '$file'"
	sed -e 's/^/X/' -e "s/$tab/\^I/g" $file
	echo 'SHAR_EOF'
	echo "if test $size -ne \"\`wc -c < '$file'\`\""
	echo "then"
	echo "	echo shar: \"error transmitting '$file'\" '(should have been $size characters)'"
	echo "fi"
	echo "fi"
done
echo "exit 0"
echo '#	End of shell archive'
#


-- 
-Gary Mills-             -University of Manitoba-             -Winnipeg-

new@udel.EDU (Darren New) (07/18/90)

In article <1425@targon.UUCP> gert@targon.UUCP (Gert Kanis) writes:
>You might be refering to xxencode (and xxdecode). This has been posted at
>least a few times (I think).

Actually, I was talking about my own format. It actually does more than
just encode and decode, but this group was not discussing that. -- Darren

new@udel.EDU (Darren New) (07/19/90)

In article <3087@syma.sussex.ac.uk> leilabd@syma.sussex.ac.uk (Leila Burrell-Davis) writes:
>No one has yet mentioned Brad Templeton's abe, which was designed to
>solve these problems, but has never achieved widespread usage.

Possibly because the read.me does not say where to get it? :-)
		  -- Darren