silvert@dalcs.UUCP (Bill Silvert) (10/18/86)
Here are some opinions about the problem of posting binaries, along with a draft solution. There should be some discussion on the net before it gets implemented. Sources are no substitute for binaries, since not everyone has the same compiler, or even language, on micros. Binaries have to be encoded as ASCII files. But there is no reason why we have to use uuencode! There are evidently problems with it, and we should feel free to invent an alternate encoding method which avoids the problems with uuencode. These problems, aside from the minor one that uuencode is designed for th Unix environment, are that some characters (such as curly braces {}) do not make it through all nodes unscathed (IBM machines and others with EBCDIC codes appear to be the culprits), and for long files the posting have to be combined in an editor. Another problem is that udecode is a complicated program which a lot of users have trouble getting or rewriting. I propose that we develop an encoding method for microcomputers that meets these requirements: > So simple that users can easily learn the protocol and write their own version of the decoding program. Uudecode is relatively easy to write in C, but gets tricky in languages that do not have low-level bit operations. > Moderately compact, to keep the traffic volume down. > Reasonably good error trapping to check for damaged files. > Convenient to use, preferably not requiring the use of an editor even for multi-part postings. One possibility would be to post hex files, but these are very bulky, at least twice as long as the binary being posted. However, a generalization of posting hex will work -- if we encounter the letter G in a hex file we know it is an error, but we can also adopt the convention that the letters G-Z do not have to be encoded, so that they are represented by one byte in the encoded file instead of two. This can save a lot of space. Based on this, here is my proposal: *** TO ENCODE A FILE *** Read through the file a byte at a time, and classify each byte as follows: >OK, pass through unchanged >TRANSFORM to a single byte >ENCODE as a pair of bytes The encoding I propose is a modified hex, using the letters A-P instead of the usual hex 0-9A-F -- the reason for this is that it is trivial to map this way, e.g., value = char - 'A'. The rest of upper case letters, Q-Z, can be used for error checking and for 1-byte transformations of common non-graphic bytes, such as NULL and NEWLINE. Thus the actual encoding rules could be: >OK includes digits 0-9, lower case alphabet, and punctuation marks. >TRANSFORM \0 -> Q, \r -> R, space -> S, \t -> T, etc. >ENCODE all upper case letters and other characters into modified hex codes, AA to PP. I have done this encoding on a number of files using a crude set of programs that I wrote a while back when I didn't have xmodem working on my net machine and couldn't get uudecode working on my micro -- the files were generally no larger than uuencoded files, often smaller. To avoid very long lines, adopt the convention that white space is ignored, so that you can put in newlines wherever you want (probably not in the middle of a hex pair though). To decode a file, one simply reverses the process. Read through the file a byte at a time, and use switch or a set of ifs to do the following: >letter A-P? Read next byte and output 16*(first-'A') + (second - 'A') >letter Q-Z? Output \0, \r, etc., according to above table. >anything else? Output it as stands. *** REFINEMENTS *** I haven't said anything yet about error checking, convenience, etc. Note that there are several byte combinations that are not used in this scheme of things, specifically a letter A-P followed by Q-Z. These can be used to add these features. For example, an encoded file should begin with the pair AZ and end with PZ, similar to the begin and end lines used by uuencode. However, we could also adopt the convention that when a file is broken into parts, the first part ends with BZ, the next begins with CZ, and so on. This way one could simply decode a set of files without first combining them -- the program would start at the AZ flag, and stop when it found BZ. Then it would go on to the next file and search for CZ, etc. If it didn't find PZ at the end of the last file, or if the codes were out of order, it would complain. Further refinements would be to add various checksums, set off by other unused code pairs. I'll pass on this one, since it sounds like a good idea, but adds to the complication. Perhaps it could be made optional, such as writing a checksum after each termination code like BZ ... PZ. If this idea seems reasonable, perhaps net moderators could carry the ball from here. Unfortunately this site is not very reliable for news and mail.
thomps@gitpyr.gatech.EDU (Ken Thompson) (10/22/86)
In article <2035@dalcs.UUCP>, silvert@dalcs.UUCP (Bill Silvert) writes: > Here are some opinions about the problem of posting binaries, along with > a draft solution. There should be some discussion on the net before it > gets implemented. > Sources are no substitute for binaries, since not everyone has the same > compiler, or even language, on micros. > Binaries have to be encoded as ASCII files. But there is no reason why > we have to use uuencode! There are evidently problems with it, and we > should feel free to invent an alternate encoding method which avoids the > problems with uuencode. These problems, aside from the minor one that > uuencode is designed for th Unix environment, are that some characters > (such as curly braces {}) do not make it through all nodes unscathed > (IBM machines and others with EBCDIC codes appear to be the culprits), > and for long files the posting have to be combined in an editor. > Another problem is that udecode is a complicated program which a lot of > users have trouble getting or rewriting. > I propose that we develop an encoding method for microcomputers that > meets these requirements: > > So simple that users can easily learn the protocol and write their own > version of the decoding program. Uudecode is relatively easy to write > in C, but gets tricky in languages that do not have low-level bit > operations. > > Moderately compact, to keep the traffic volume down. > > Reasonably good error trapping to check for damaged files. > > Convenient to use, preferably not requiring the use of an editor even > for multi-part postings. > I, for one, am strongly opposed to trying to develop a new "standard" decoding scheme. For one thing, it will be next to impossible to get agreement from such a large group on what it should be. The effort will cause more problems and confusion than we already have. I have never experienced the problems with curly braces but it is possible I suppose that we don't ever get messages that pass through EBCDIC machines here. Certainly, I have never gotten any C source that had the curly braces corrupted. I have versions of uudecode in turbo pascal, c , and microsoft basic. This is pretty wide availability and the basic while slow should be easily adaptible to most machines. I don't think there will be too many problems getting a version. Just ask in net.sources.wanted. The problem with having to use the editor has nothing to do with uuencode/uudecode. The problem is that some news software running on some machines on the net truncate files longer than 64K bytes. Unless the mail software is changed, you will always need to get rid of the mail header/signature information put in by mail as this is ascii too. You will probably have to do this with an editor. I failed to find this task difficult. I don't know about other sites, but 99% of the problems we have here are files which are corrupted in transit, either because the sender posted files larger than 64K and they got truncated or some of the information was just corrupted in transit. A new scheme is not going to fix this. I vote that we stick with uudecode/encode. If you have problems with these, I am sure someone on the net will be glad to help you get them worked out. I receive files all the time that have been arced and then uuencoded and am able to reverse the process without problems. -- Ken Thompson Phone : (404) 894-7089 Georgia Tech Research Institute Georgia Insitute of Technology, Atlanta Georgia, 30332 ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!thomps
ken@argus.UUCP (Kenneth Ng) (10/25/86)
In article <2035@dalcs.UUCP>, silvert@dalcs.UUCP (Bill Silvert) writes: > Here are some opinions about the problem of posting binaries, along with > a draft solution. There should be some discussion on the net before it > gets implemented. > > Binaries have to be encoded as ASCII files. But there is no reason why > we have to use uuencode! There are evidently problems with it, and we > should feel free to invent an alternate encoding method which avoids the > problems with uuencode. These problems, aside from the minor one that > uuencode is designed for th Unix environment, are that some characters > (such as curly braces {}) do not make it through all nodes unscathed > (IBM machines and others with EBCDIC codes appear to be the culprits), > and for long files the posting have to be combined in an editor. The problem is square brackets, which do not exist in EBCDIC in any standard form. This becomes a real pain even with source programs written in C and Pascal. -- Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey 07102 uucp !ihnp4!allegra!bellcore!argus!ken *** WARNING: NOT ken@bellcore.uucp *** !psuvax1!cmcl2!ciap!andromeda!argus!ken bitnet(prefered) ken@orion.bitnet McCoy: "This won't hurt a bit" Chekov: "That's what you said last time" McCoy: "Did it?" Chekov: "Yes"
pes@bath63.UUCP (Paul Smee) (10/31/86)
I'd agree with Ken Thompson. uuencode may not be great, but it is usable -- sometimes with a little thought using inspiration derivable from the uuencode doc. And, it is more-or-less standardly available on Unix systems. If we invent a new encode decode technique, I foresee continual requests from new people joining the net for yet-another retransmission of fredcode -- or whatever it's called. From this standpoint, even hex encoding is optimal, as it is obvious when you see a hex file what you should do to unpack it. I fear new baroque and clever coding techniques will cause more confusion than they solve. (Unless, of course, you can manage to get the new stuff packaged in as a standard bit of Unix systems.) (Also, a quick comment on the original version of the proposal -- The character set A-P is only contiguous on ASCII machines. Fine for my purposes, but not that handy for EBCDIC users. Of course, I've been known to argue that EBCDIC users deserve whatever they get, but I'm prepared to accept that others might disagree; and besides, it's not clear there *is* any nice subset of chars in EBCDIC.)