jac@paul.rutgers.edu (Jonathan A. Chandross) (09/18/90)
Submitted-by: jac Posting-number: Volume 1, Administrivia: 13 A complete treatise on beginning of line sentinels and BITNET chomping. -------------------------------------------------------------------------------- From: jms@tardis.tymnet.com (Joe Smith) > it is better if such things insert a single character at the beginning > of each line (such as ">", " ", or "="). > [ I don't know what the selector buys you. (stuff on quoted formfeeds) > The ">" vs " " distinction is not necessary here, unlike with shar. If you don't put an extra character in front of every line, you will find that some of the sources won't work, depending on which systems the posting was forwarded through. Newsgroups such as this one are often forwarded to people who don't have access to a full news feed. Many mailers will slightly mangle the posting during delivery. Some known problems are: 1) Removal of all trailing blanks on a line 2) Padding every line with blanks until it is 80 columns wide 3) Translating 2 consecutive linefeeds to <lf><space><lf> 4) Translating <lf><space><lf> to <lf><lf> 5) Translating "From " at start of line to ">From ". 6) Aborting on any line with a period in column 1. The last four problems can be avoided by prefixing every line of the source with an agreed upon character. Considering that sources from this newsgroup will be going through Unix mailers and BITNET gateways, the use of a prefix character is not just an option; it is mandatory. Don't take my word for it - talk to the moderators of the other source groups and ask them what kind of mangling can happen to posted sources, and what methods work best. (CP/M-80 did not have UNSHAR, and ran into these same problems in the days of net.micro.) -------------------------------------------------------------------------------- From: Doug Gwyn (VLD/VMB) <gwyn@brl.mil> I believe that it is the job of the mail exchange facilities to take care of differences in host text file formats, which is what items 1, 2, 3, and 4 are due to. In fact there are even more variations than those mentioned. We had to consider such stuff when drafting X3.159. Item 5 is caused by an inferior UNIX mailer that can be tracked back to at least Sixth Edition UNIX (circa 1975). That mailer did not have any means of distinguishing between headers and message bodies, which it stupidly kept in the same mailbox file, so it simply assumed that a line starting with "From" started a header. It didn't take long for people to notice that the mail reader misinterpreted some messages due to this assumption not being quite perfect, so the mailer was hacked to stick a ">" in front of mailed text lines that started with "From". However, this is not an unambiguously invertible mapping, and in fact the mail reader was not simultaneously modified to even attempt to undo the mapping. This was never a correct solution for intersystem exchange of mail, for which the simplistic rule about the structure of headers is seldom correct, and shouldn't be encountered on systems that are capable of receiving Usenet news or mail from alien systems, which is what we are talking about for these archives. Item 6 is due to the lack of a correct mapping between the sent message and the form in which it appears at a "mail server port". There are numerous such warts known among various mailers, including problems with lines starting with ~ characters. Problems occur when mailers send the message directly to a mail server port without regard for the fact that special characters in the message body could cause inadvertent control actions; a correct solution to in-band control requires mapping on both ends. As an illustration of the fact that this can be done, I have no problem at all sending and receiving such messages via our MMDF facilities, which conform to SMTP. The bottom line is that misinterpretation of message characters is a generic concern of certain mailers, so the mailers should take care of their own self-imposed problems, not impose constraints on the content of messages that can be reliably sent through those facilities. If the providers of the mail software fail to do their job properly, as seems to often be the case, then whatever means are used to gateway between Usenet newsgroups and sites using inferior mailers should apply the needed additional "wrapper" protection to circumvent the problems. As a matter of proper design, this should not be left up to individual messages or newsgroups, but should be addressed on a global basis. As far as simply different text file formats is concerned, not much can be done about that other than to realize that it can happen. The native text-file formats will be used at some point during the transfer, unpacking, and use of text archives among systems. Thus, significant spaces at the ends of lines should be avoided; this is not usually a problem for source code in traditional programming languages nor for documentation. Most "control characters" should also be avoided, since not all hosts will have ways to represent all control actions in conjunction with text storage. One might argue that a solution would be to require all host formats and support for text files to conform to relevant standards, but there is zero chance of that ever happening. Posted source archives will be generally much more useable if their posted form is as close to "virgin" as possible. As a matter of fact, I don't expect mailer warts to cause much trouble in our case anyway, since VERY few Apple II sources would have lines starting with ., ~, etc. -------------------------------------------------------------------------------- From: Doug Gwyn (VLD/VMB) <gwyn@brl.mil> I personally think there is no excuse for BITNET mailers to fail to conform to the official mail protocols that provide for transparent delivery of arbitrary text messages. The problem with adding still more "escapes" in the archiving scheme is that there is no end to them. Will ! characters have to be escaped in the middle of lines? How can we possibly guess whether or not that will make somebody's mailer sick? It is possible to assume a "lowest common denominator" and map everything into that form, but then the files end up looking like "uuencoded" data, which is a severe drawback. Note that even ASCII HT (tab) characters can be problematic for some systems, yet any standard-conforming implementation of C must allow them in C source code. Note also that ~ and ^ characters are commonly found in C and Pascal source code, while $ is found in AppleSoft BASIC source code. Yet $ is not available on all systems, and who knows to what it might get mapped? Are we supposed to escape $ also? This is obviously a bottomless pit. Why is BITNET entitled to special consideration for problems that are merely due to poor implementation of mail gatewaying? If they connect to the Internet, aren't they obliged to conform to Internet standards at the interface? Perhaps it is time to make a stand, and tell BITNET sites that we are tired of kludging around a problem that we have every right to expect BITNET administration to have solved long ago. -------------------------------------------------------------------------------- [ Some overall comments. I end up with files butchered by mailers; leading periods are doubled and a leading "From" becomes ">From". I've always regarded this as a disaster. Adding a sentinel does fix the problem, although the right way to solve the problem is to fix the mailers suffering from brain damage. Anyway, I am willing to adopt a sentinel at the beginning of lines: file-name -line_1 -... -line_n For example, test.c -main() -{ -... -} I think this solves the problem and is not too ugly. ] -------------------------------------------------------------------------------- To get something posted to comp.sources.apple2, send it to: Internet: jac@paul.rutgers.edu UUCP: rutgers!paul.rutgers.edu!jac Please mark comments that are not to be posted with "Not for Posting". Otherwise, I often can't tell. Jonathan A. Chandross Internet: jac@paul.rutgers.edu UUCP: rutgers!paul.rutgers.edu!jac