[net.micro] New way to post binaries

silvert@dalcs.UUCP (Bill Silvert) (10/18/86)

Here are some opinions about the problem of posting binaries, along with
a draft solution.  There should be some discussion on the net before it
gets implemented.

Sources are no substitute for binaries, since not everyone has the same
compiler, or even language, on micros.

Binaries have to be encoded as ASCII files.  But there is no reason why
we have to use uuencode!  There are evidently problems with it, and we
should feel free to invent an alternate encoding method which avoids the
problems with uuencode.  These problems, aside from the minor one that
uuencode is designed for th Unix environment, are that some characters
(such as curly braces {}) do not make it through all nodes unscathed
(IBM machines and others with EBCDIC codes appear to be the culprits),
and for long files the posting have to be combined in an editor.
Another problem is that udecode is a complicated program which a lot of
users have trouble getting or rewriting.

I propose that we develop an encoding method for microcomputers that
meets these requirements:

> So simple that users can easily learn the protocol and write their own
version of the decoding program.  Uudecode is relatively easy to write
in C, but gets tricky in languages that do not have low-level bit
operations.

> Moderately compact, to keep the traffic volume down.

> Reasonably good error trapping to check for damaged files.

> Convenient to use, preferably not requiring the use of an editor even
for multi-part postings.

One possibility would be to post hex files, but these are very bulky, at
least twice as long as the binary being posted.  However, a
generalization of posting hex will work -- if we encounter the letter G
in a hex file we know it is an error, but we can also adopt the
convention that the letters G-Z do not have to be encoded, so that they
are represented by one byte in the encoded file instead of two.  This
can save a lot of space.  Based on this, here is my proposal:

	*** TO ENCODE A FILE ***

Read through the file a byte at a time, and classify each byte as
follows:

>OK, pass through unchanged

>TRANSFORM to a single byte

>ENCODE as a pair of bytes

The encoding I propose is a modified hex, using the letters A-P instead
of the usual hex 0-9A-F -- the reason for this is that it is trivial to
map this way, e.g., value = char - 'A'.  The rest of upper case letters,
Q-Z, can be used for error checking and for 1-byte transformations of
common non-graphic bytes, such as NULL and NEWLINE.  Thus the actual
encoding rules could be:

>OK includes digits 0-9, lower case alphabet, and punctuation marks.

>TRANSFORM \0 -> Q, \r -> R, space -> S, \t -> T, etc.

>ENCODE all upper case letters and other characters into modified hex
codes, AA to PP.

I have done this encoding on a number of files using a crude set of
programs that I wrote a while back when I didn't have xmodem working on
my net machine and couldn't get uudecode working on my micro -- the
files were generally no larger than uuencoded files, often smaller.

To avoid very long lines, adopt the convention that white space is
ignored, so that you can put in newlines wherever you want (probably not
in the middle of a hex pair though).

To decode a file, one simply reverses the process.  Read through the
file a byte at a time, and use switch or a set of ifs to do the
following:

>letter A-P?  Read next byte and output 16*(first-'A') + (second - 'A')

>letter Q-Z?  Output \0, \r, etc., according to above table.

>anything else?  Output it as stands.

	*** REFINEMENTS ***

I haven't said anything yet about error checking, convenience, etc.
Note that there are several byte combinations that are not used in this
scheme of things, specifically a letter A-P followed by Q-Z.  These can
be used to add these features.  For example, an encoded file should
begin with the pair AZ and end with PZ, similar to the begin and end
lines used by uuencode.  However, we could also adopt the convention
that when a file is broken into parts, the first part ends with BZ, the
next begins with CZ, and so on.  This way one could simply decode a set
of files without first combining them -- the program would start at the
AZ flag, and stop when it found BZ.  Then it would go on to the next
file and search for CZ, etc.  If it didn't find PZ at the end of the
last file, or if the codes were out of order, it would complain.

Further refinements would be to add various checksums, set off by other
unused code pairs.  I'll pass on this one, since it sounds like a good
idea, but adds to the complication.  Perhaps it could be made optional,
such as writing a checksum after each termination code like BZ ... PZ.

If this idea seems reasonable, perhaps net moderators could carry the
ball from here.  Unfortunately this site is not very reliable for news
and mail.

thomps@gitpyr.gatech.EDU (Ken Thompson) (10/22/86)

In article <2035@dalcs.UUCP>, silvert@dalcs.UUCP (Bill Silvert) writes:
> Here are some opinions about the problem of posting binaries, along with
> a draft solution.  There should be some discussion on the net before it
> gets implemented.
> Sources are no substitute for binaries, since not everyone has the same
> compiler, or even language, on micros.
> Binaries have to be encoded as ASCII files.  But there is no reason why
> we have to use uuencode!  There are evidently problems with it, and we
> should feel free to invent an alternate encoding method which avoids the
> problems with uuencode.  These problems, aside from the minor one that
> uuencode is designed for th Unix environment, are that some characters
> (such as curly braces {}) do not make it through all nodes unscathed
> (IBM machines and others with EBCDIC codes appear to be the culprits),
> and for long files the posting have to be combined in an editor.
> Another problem is that udecode is a complicated program which a lot of
> users have trouble getting or rewriting.
> I propose that we develop an encoding method for microcomputers that
> meets these requirements:
> > So simple that users can easily learn the protocol and write their own
> version of the decoding program.  Uudecode is relatively easy to write
> in C, but gets tricky in languages that do not have low-level bit
> operations.
> > Moderately compact, to keep the traffic volume down.
> > Reasonably good error trapping to check for damaged files.
> > Convenient to use, preferably not requiring the use of an editor even
> for multi-part postings.
> 
I, for one, am strongly opposed to trying to develop a new "standard"
decoding scheme. For one thing, it will be next to impossible to get
agreement from such a large group on what it should be. The effort will
cause more problems and confusion than we already have. 

I have never experienced the problems with curly braces but it is possible
I suppose that we don't ever get messages that pass through EBCDIC machines
here. Certainly, I have never gotten any C source that had the curly braces
corrupted. 

I have versions of uudecode in turbo pascal, c , and microsoft basic. This
is pretty wide availability and the basic while slow should be easily 
adaptible to most machines. I don't think there will be too many problems
getting a version. Just ask in net.sources.wanted.

The problem with having to use the editor has nothing to do with uuencode/uudecode. The problem is that some news software running on some machines on the net
truncate files longer than 64K bytes. Unless the mail software is changed,
you will always need to get rid of the mail header/signature information
put in by mail as this is ascii too. You will probably have to do this
with an editor. I failed to find this task difficult.

I don't know about other sites, but 99% of the problems we have here are
files which are corrupted in transit, either because the sender posted
files larger than 64K and they got truncated or some of the information
was just corrupted in transit. A new scheme is not going to fix this. 

I vote that we stick with uudecode/encode. If you have problems with these,
I am sure someone on the net will be glad to help you get them worked out.
I receive files all the time that have been arced and then uuencoded and
am able to reverse the process without problems.


-- 
Ken Thompson  Phone : (404) 894-7089
Georgia Tech Research Institute
Georgia Insitute of Technology, Atlanta Georgia, 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!thomps

ken@argus.UUCP (Kenneth Ng) (10/25/86)

In article <2035@dalcs.UUCP>, silvert@dalcs.UUCP (Bill Silvert) writes:
> Here are some opinions about the problem of posting binaries, along with
> a draft solution.  There should be some discussion on the net before it
> gets implemented.
> 
> Binaries have to be encoded as ASCII files.  But there is no reason why
> we have to use uuencode!  There are evidently problems with it, and we
> should feel free to invent an alternate encoding method which avoids the
> problems with uuencode.  These problems, aside from the minor one that
> uuencode is designed for th Unix environment, are that some characters
> (such as curly braces {}) do not make it through all nodes unscathed
> (IBM machines and others with EBCDIC codes appear to be the culprits),
> and for long files the posting have to be combined in an editor.

The problem is square brackets, which do not exist in EBCDIC in any
standard form.  This becomes a real pain even with source programs
written in C and Pascal.

-- 
Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey  07102
uucp !ihnp4!allegra!bellcore!argus!ken
     ***   WARNING:  NOT ken@bellcore.uucp ***
     !psuvax1!cmcl2!ciap!andromeda!argus!ken
bitnet(prefered) ken@orion.bitnet

McCoy: "This won't hurt a bit"
Chekov: "That's what you said last time"
McCoy: "Did it?"
Chekov: "Yes"

pes@bath63.UUCP (Paul Smee) (10/31/86)

I'd agree with Ken Thompson.  uuencode may not be great, but it is usable --
sometimes with a little thought using inspiration derivable from the
uuencode doc.  And, it is more-or-less standardly available on Unix systems.
If we invent a new encode decode technique, I foresee continual requests from
new people joining the net for yet-another retransmission of fredcode -- or
whatever it's called.

From this standpoint, even hex encoding is optimal, as it is obvious when
you see a hex file what you should do to unpack it.  I fear new baroque
and clever coding techniques will cause more confusion than they solve.
(Unless, of course, you can manage to get the new stuff packaged in as a
standard bit of Unix systems.)

(Also, a quick comment on the original version of the proposal --
The character set A-P is only contiguous on ASCII machines.  Fine for
my purposes, but not that handy for EBCDIC users.  Of course, I've been
known to argue that EBCDIC users deserve whatever they get, but I'm
prepared to accept that others might disagree; and besides, it's not clear
there *is* any nice subset of chars in EBCDIC.)