[alt.sources.d] A readable, robust encoding for source postings

barrett@Daisy.EE.UND.AC.ZA (Alan P. Barrett) (12/29/90)

In article <MEISSNER.90Dec28123513@curley.osf.org>, meissner@osf.org
(Michael Meissner) writes:
> On the other hand, ever since I've switched to compress + uuencode +
> shar for shipping out large sets of patches, I've had fewer people
> complain about news/mail trashing the file or subsequent patches
> failing, since some 'helpful' intermediary decided to put a newline in
> column 79, or change tabs to spaces, or....

I sympathise, having been on the receiving end of a feed that changed
tabs to spaces, or appended extra spaces to the ends of some lines, or
inserted extra newlines, or broke messages into lots of little pieces
without labelling the parts properly.  (This is no reflection on the
people who provided that feed -- they did their best.)

I think that the correct way to fix this is to use an encoding that is
both readable and robust.  A version of shar that does stuff like
encoding tabs as \t and wrapping lines in a reversible way would do it.
In fact, there was a lot of discussion on this topic here several months
ago.  Sorry, I don't remember details, but I thought that somebody was
going to do some real work on coming up with a suitable standard?

--apb
Alan Barrett, Dept. of Electronic Eng., Univ. of Natal, Durban, South Africa
Internet: barrett@ee.und.ac.za (or %ee.und.ac.za@saqqara.cis.ohio-state.edu)
UUCP: m2xenix!quagga!undeed!barrett    PSI-Mail: PSI%(6550)13601353::BARRETT

allbery@NCoast.ORG (Brandon S. Allbery KB8JRR) (12/30/90)

As quoted from <1990Dec29.114801.5895@Daisy.EE.UND.AC.ZA> by barrett@Daisy.EE.UND.AC.ZA (Alan P. Barrett):
+---------------
| I think that the correct way to fix this is to use an encoding that is
| both readable and robust.  A version of shar that does stuff like
| encoding tabs as \t and wrapping lines in a reversible way would do it.
| In fact, there was a lot of discussion on this topic here several months
| ago.  Sorry, I don't remember details, but I thought that somebody was
| going to do some real work on coming up with a suitable standard?
+---------------

Brad Templeton's ABE is in the comp.sources.misc archives.  It is a robust,
readable encoding that includes line-by-line checksums and mapping of
characters that don't survive EBCDIC translations (and others).  I would
personally consider starting from there, as Brad decided not to make it share-
ware/commercial in part as a possible solution to this issue.

++Brandon
-- 
Me: Brandon S. Allbery			    VHF/UHF: KB8JRR on 220, 2m, 440
Internet: allbery@NCoast.ORG		    Packet: KB8JRR @ WA8BXN
America OnLine: KB8JRR			    AMPR: KB8JRR.AmPR.ORG [44.70.4.88]
uunet!usenet.ins.cwru.edu!ncoast!allbery    Delphi: ALLBERY

darcy@druid.uucp (D'Arcy J.M. Cain) (12/31/90)

In article <1990Dec29.114801.5895@Daisy.EE.UND.AC.ZA> Alan P. Barrett writes:
> [...]
>I think that the correct way to fix this is to use an encoding that is
>both readable and robust.  A version of shar that does stuff like
>encoding tabs as \t and wrapping lines in a reversible way would do it.
>In fact, there was a lot of discussion on this topic here several months
>ago.  Sorry, I don't remember details, but I thought that somebody was
>going to do some real work on coming up with a suitable standard?

I posted my genfiles program which I hoped would be a jumpimg off point for
such an effort.  Has anyone looked at it and have suggestions to enhance
the protocols I suggested?

Here is the Readme from the distribution:

-------------------------------------------------------------------------
This is my file generation utility.  The genfiles program reads in a
script from the standard input and creates files based on the contents.
There is some parameter substitution as well.  The mkscript program is
an easy way of creating the scripts used by genfiles.  See the source
files for further details.

These programs are being offered as a possible solution to problems of
transfering files between different networks without changing them.  The
utilities in this distribution were originally written for different
purposes and have been hacked on in order to make a start on some sort of
solution.  Some of the issues addressed (and hopefully solved) are:

    The lines can be split if desired and restored on the receiving end.

    Many troublesome characters are translated to less troublesome ones
    and restored on the receiving end.

    The files are transmitted in a form that can be read without any
    further processing.  This is ***NOT*** a uuencoding type program.

    By modifying the code, systems that need some characters converted
    to trigraphs can do so by simply commenting out the case statement
    that converts the troublesome character(s).

What it doesn't have yet is multi-part support other than splitting
up the resulting file and restoring it by hand.  I will try to do
something about this.

It also doesn't unpack itself like shar files do but the program is
fairly simple and can easily be written for systems that can't use
this one for some reason.

In order to use the program that creates the script file you either
have to pick up my getarg program which I recently posted or else
hack the source to use normal getopt.  If you can't get getarg from
a local archive site you can get it from my machine's mail server.
Send mail to unix-server@druid.UUCP with the following line in the
body of the message:

send getarg.c

Note if the mail to the server gets too heavy I will have to shut it
down for my neighbours sake so please use it as a last resort.  This
is just a lowly leaf node with a single 2400 baud modem.

The program to unpack the files is self contained.

D'Arcy J.M. Cain
D'Arcy Cain Consulting
West Hill, Ontario
darcy@druid.UUCP
---------------------------------------------------------------------
-- 
D'Arcy J.M. Cain (darcy@druid)     |
D'Arcy Cain Consulting             |   There's no government
West Hill, Ontario, Canada         |   like no government!
+1 416 281 6094                    |

rhys@batserver.cs.uq.oz.au (Rhys Weatherley) (12/31/90)

In <1990Dec30.170302.21665@druid.uucp> darcy@druid.uucp (D'Arcy J.M. Cain) writes:

>In article <1990Dec29.114801.5895@Daisy.EE.UND.AC.ZA> Alan P. Barrett writes:
>> [...]
>>I think that the correct way to fix this is to use an encoding that is
>>both readable and robust.  A version of shar that does stuff like
>>encoding tabs as \t and wrapping lines in a reversible way would do it.
>
>I posted my genfiles program which I hoped would be a jumpimg off point for
>such an effort.  Has anyone looked at it and have suggestions to enhance
>the protocols I suggested?

I missed the original discussion, so I may be repeating things, but
the central problem I think there will be in getting a new transmission
standard off the ground is actually making it a standard :-).  unshar,
uuencode and the like are very widespread, and trying to shake their
ground may be very hard.  Maybe in the interim a cut-down "encoder"
is needed that can be wrapped-up in a shar archive, and will be unpacked,
compiled and run to unpack the rest.  e.g. the shar archive could look
something like this:

		... head information ...
		sed ... >/tmp/decode.c <<EOF
		... source code for decode.c ...
		EOF
		cc -o /tmp/decode /tmp/decode.c
		sed ... | /tmp/decode >file <<EOF 
		... file contents ...
		EOF

It should be possible to get a very compact decoding program that could
be wrapped up with the shell archives.  Won't solve all the problems
but may help, as well as its being reasonably compatible with the
existing shar archiving system.  Well, that's my thoughts on the matter,
what do you think?

Rhys.

P.S. D'Arcy, could you tell us where your program may be found, since
     I missed it first time around.

+===============================+==================================+
||  Rhys Weatherley             |  The University of Queensland,  ||
||  rhys@batserver.cs.uq.oz.au  |  Australia.  G'day!!            ||
+===============================+==================================+

terry@galaxia.newport.ri.us (01/01/91)

As long as the discussion about packing source code has been reopened,
how about extending the discussion to include alternatives to shar.
In addition to USENET postings being routed to different networks with
some character incompatiblities there are also now a large number of non-unix
machines connected to the network.  While these machines do not have SH with
which to unpack a shar posting, there are unshar program that will unpack
some of the shar postings, I have a couple.  Note, I said some of the postings.
The trouble is there is always a new version of shar being used which breaks
the old unpackers, mine cannot unpack the latest distributions.  Also, the
source code for these unpackers is not widely distributed, which makes it
difficult to change it or for newcommers to obtain one.  Furthermore, there
has been some concern expressed about the security of using shar.  Therefore,
I am suggesting that there be some serious discussion about using a packing
format, with distributed code as is done with uudecode, to replace shar.

Comments anyone?

raymond!terry@galaxia.newport.ri.us
{rayssd,xanth,lazlo,mirror,att}!galaxia!raymond!terry

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (01/01/91)

rhys@batserver.cs.uq.oz.au writes:
> darcy@druid.uucp (D'Arcy J.M. Cain) writes:

>>In article <1990Dec29.114801.5895@Daisy.EE.UND.AC.ZA> Alan P. Barrett writes:
>>> [...]
>>>I think that the correct way to fix this is to use an encoding that is
>>>both readable and robust.  A version of shar that does stuff like
>>>encoding tabs as \t and wrapping lines in a reversible way would do it.

>>I posted my genfiles program which I hoped would be a jumpimg off point for
>>such an effort.  Has anyone looked at it and have suggestions to enhance
>>the protocols I suggested?

>I missed the original discussion, so I may be repeating things, but
>the central problem I think there will be in getting a new transmission
>standard off the ground is actually making it a standard :-).  unshar,
>uuencode and the like are very widespread, and trying to shake their
>ground may be very hard.  Maybe in the interim a cut-down "encoder"
>is needed that can be wrapped-up in a shar archive, and will be unpacked,
>compiled and run to unpack the rest.  e.g. the shar archive could look
>something like this:

>		... head information ...
>		sed ... >/tmp/decode.c <<EOF
>		... source code for decode.c ...
>		EOF
>		cc -o /tmp/decode /tmp/decode.c
>		sed ... | /tmp/decode >file <<EOF 
>		... file contents ...
>		EOF

>It should be possible to get a very compact decoding program that could
>be wrapped up with the shell archives.  Won't solve all the problems
>but may help, as well as its being reasonably compatible with the
>existing shar archiving system.  Well, that's my thoughts on the matter,
>what do you think?

Problem is, lots of shars are unpacked on systems where the C compiler
command isn't spelled "cc", lots of shars don't contain C code and may
be unpacked on systems where, e.g., Modula-2 is the only compilable
language, in fact, I unpack lots of shars on my Amiga, where "sed"
doesn't exist, and the "unshar" program fakes it by knowing the format
of ordinary shar file "sed" commands and doing what's right.

Probably, despite the calls here for clear text, a much more robust way
to transmit source files is the one used in, for example,
comp.binaries.ibm.pc, where the expected resources at a site are
"uudecode", which can be transmitted in clear text as a BASIC or C
program, and some widely available archiving program; the one of choice
now is zoo, but lharc is coming up fast due to a superior packing
algorithm.

Add to that the "brik" CRC check, the zoo internal CRC checks, and the
short line, limited character set, uuencode format with line by line
checksums, and you have an extremely robust encoding that can transit
ASCII to EBCDIC to ASCII intact, and doesn't challenge developmentally
disabled news software, which we will always have with us.

The major requirement for this method is that there needs to be a very
explicit clear text explanation of the purpose and contents of the
archive to let the reader make a decision whether it is worth unpacking.

I'm not thrilled when I take the time to unpack and catenate and
uudecode an archive with an interesting description from the PC-clone
universe, to find out that it doesn't contain the source code I was
seeking/expecting; in hopes of stealing some code and ideas for a port
of the functionality. A minimal description should include source or
not, data types, platforms, compiler technology required, functionality,
and copyright status.

To another poster's comments that folks on EBCDIC systems have to solve
their own character set and newline encoding problems, that misses the
point.  Lots of ASCII to ASCII routings these days arrive with a BITNET
host as an intermediary, so even the ASCII destination sites have to
be concerned about the problem of an encoding that can survive the
transit.

I think the current pleas to keep the comp.sources.{unix,games,misc} and
alt.sources postings all clear text, while understandable, are
misdirected on today's net.

And, again to another posting, no, the world is not all becoming USENet,
to live under our way of doing things, just because the nets are being
gatewayed together and sharing code in a much larger universe. The
greater net is a community of peer networks, each with its own peculiar
needs and requirements, not a set of subordinates to the least organized
and most contentious member of the set, USENet.

Thus it behooves us to find methods that cause as few problems as
possible in getting code across this wider universe of communication,
and clear text transmission doesn't seem to be the appropriate technique
anymore.

In my opinion, but I pack and unpack a _lot_ of source; .6 gigabytes
compressed, at last count, not bad for a personal archive.  That translates
into several thousand archives of various sorts that I've unpacked.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>

tneff@bfmny0.BFM.COM (Tom Neff) (01/01/91)

In article <62-raymond-terry> raymond!terry@galaxia.newport.ri.us (Terry Raymond) writes:
>The trouble is there is always a new version of shar being used which breaks
>the old unpackers, mine cannot unpack the latest distributions.  

Yes indeedy, because there is always someone out there to gild the lily.
Witness the latest pointless "TOUCH=cannot" oddity.

What would be nice would be for someone to concoct a Perl or C program
which accepts a signature prefix to look for (/^X/ for instance) and
writes files from raw shar's, regardless of format.

>I am suggesting that there be some serious discussion about using a packing
>format, with distributed code as is done with uudecode, to replace shar.

There are ASCII archives out there.  The problem is enforcing a
standard.

emv@ox.com (Ed Vielmetti) (01/01/91)

In article <1990Dec31.232624.23510@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:

   The major requirement for this method is that there needs to be a very
   explicit clear text explanation of the purpose and contents of the
   archive to let the reader make a decision whether it is worth unpacking.

The text explanation should be in a separate article from the globs of
binary encoded stuff.  It should probably be cross-posted to a
relevant group, so that people who don't read the binary-encoded
sources group can be informed in a timely fashion of new stuff.  In
addition it should have a very clear and precise description of where
a reasonably fresh version can be FTP'd from, if the code is available
in that way; that will facilitate reposting (just) the announcement
into comp.archives.  

For that matter, a separate "this is what it is and where you can get
it" would be useful for any set of postings to alt.sources, sort of a
"part 0 of 15" which would be sent around more widely than the other
half a megabyte blortful.

--Ed
emv@ox.com

darcy@druid.uucp (D'Arcy J.M. Cain) (01/01/91)

In article <6540@uqcspe.cs.uq.oz.au> rhys@batserver.cs.uq.oz.au writes:
>In <1990Dec30.170302.21665@druid.uucp> darcy@druid.uucp (D'Arcy J.M. Cain) writes:
>>I posted my genfiles program which I hoped would be a jumpimg off point for
>>such an effort.  Has anyone looked at it and have suggestions to enhance
>>the protocols I suggested?
>
>I missed the original discussion, so I may be repeating things, but
>the central problem I think there will be in getting a new transmission
>standard off the ground is actually making it a standard :-).  unshar,
True.

>uuencode and the like are very widespread, and trying to shake their
>ground may be very hard.  Maybe in the interim a cut-down "encoder"
>is needed that can be wrapped-up in a shar archive, and will be unpacked,
>compiled and run to unpack the rest.  e.g. the shar archive could look
This is still not universal.  It only works on Unix like systems.  A
standard should operate under any OS.  That is why my system was made simple
so that unpackers can be easily written for any platform.  Also note that
using shell to unpack can be a security hole.

[stuff deleted]

>
>P.S. D'Arcy, could you tell us where your program may be found, since
>     I missed it first time around.
I posted to alt.sources so check with the local archive sites.  I was going
to post to comp.sources.misc once I got some feedback and fixed up any
problems people had with it but so far there have been very few suggestions
for fixing it up.  Naturally this means that the code is perfect and bug-free
and no fixes are necessary.  :-)  Actually I have been adding support for
multiple input files which was not present in the first version.  As the
shar is only 13634 bytes (13415 for a genfiles script) I could probably
post another interim version but I don't want to clutter up everyone's
archives unnecessarily so I will probably wait till I have done a few
more fixes and tested it.  In particular the program to unpack is almost
done but the program to create the scripts, while working, can use some
more enhancements.

-- 
D'Arcy J.M. Cain (darcy@druid)     |
D'Arcy Cain Consulting             |   There's no government
West Hill, Ontario, Canada         |   like no government!
+1 416 281 6094                    |

rhys@batserver.cs.uq.oz.au (Rhys Weatherley) (01/01/91)

In <1990Dec31.232624.23510@zorch.SF-Bay.ORG> xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:

>Problem is, lots of shars are unpacked on systems where the C compiler
>command isn't spelled "cc", lots of shars don't contain C code and may
>be unpacked on systems where, e.g., Modula-2 is the only compilable
>language, in fact, I unpack lots of shars on my Amiga, where "sed"
>doesn't exist, and the "unshar" program fakes it by knowing the format
>of ordinary shar file "sed" commands and doing what's right.

So much for that idea :-)

Then again, maybe we just need better co-ordination between the source
and binary groups and the FTP sites around the world, since they have
the best transmission method: store it the way it was meant to be!
How about the moderators refuse to submit a program unless it has
first been submitted to one or more of the major FTP sites, and they
are obliged to list the FTP sites where it can be found in the
initial blurb about the program?

Rhys.

+===============================+==================================+
||  Rhys Weatherley             |  The University of Queensland,  ||
||  rhys@batserver.cs.uq.oz.au  |  Australia.  G'day!!            ||
+===============================+==================================+

barrett@Daisy.EE.UND.AC.ZA (Alan P. Barrett) (01/02/91)

In article <1990Dec31.232624.23510@zorch.SF-Bay.ORG>,
xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) writes:
> Probably, despite the calls here for clear text, a much more robust
> way to transmit source files is [some uuencoded binary format].
>
> The major requirement for this method is that there needs to be a very
> explicit clear text explanation of the purpose and contents of the
> archive to let the reader make a decision whether it is worth
> unpacking.

The lack of such descriptions in the postings that we do get is a big
argument in favour of readable source postings.  If the posting is
readable then we can at least look through it in an attempt to find out
what it does, and whether it relies on system features we do not have.
Even a short (say 30 line) description is frequently too little to allow
a decision to be made.

--apb
Alan Barrett, Dept. of Electronic Eng., Univ. of Natal, Durban, South Africa
Internet: barrett@ee.und.ac.za (or %ee.und.ac.za@saqqara.cis.ohio-state.edu)
UUCP: m2xenix!quagga!undeed!barrett    PSI-Mail: PSI%(6550)13601353::BARRETT

bengtl@maths.lth.se (Bengt Larsson) (01/05/91)

In article <75110375@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes:
>In article <62-raymond-terry> raymond!terry@galaxia.newport.ri.us (Terry Raymond) writes:

>>I am suggesting that there be some serious discussion about using a packing
>>format, with distributed code as is done with uudecode, to replace shar.
>
>There are ASCII archives out there.  The problem is enforcing a
>standard.

How about an RFC for a basic "shar" format? After all, there's RFCs for mail 
digest formats and such. The RFC could define a pretty fixed format (easy to
parse for foreign "unshars"), and it just happens to unpack itself using
"sh".

I think the "standard shar" should contain the "sed" command with "X" starting
lines, the "wc" check for (a little) security, and (maybe) the check
for overwriting files. Plus "#" for comments, "echo" for messages, and 
"exit" to skip signatures at the end. That should do it, short and simple 
(and KISS!)

Admittedly, the "shar" is Unix-centric, but it would be standardizing
existing practice (normally considered to be a Good Thing).

Comments?

Bengt L.

PS. I had a text archive format proposed last time this discussion was
    around. This would be more general than "shar". I suppose I could
    repost that, if there's any interest. 
DS.
-- 
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: bengtl@maths.lth.se             SUNET:    TYCHE::BENGT_L