[comp.sources.apple2] v01_ADM_13: Sentinels and bitnet chomping

jac@paul.rutgers.edu (Jonathan A. Chandross) (09/18/90)

Submitted-by: jac
Posting-number: Volume 1, Administrivia: 13


A complete treatise on beginning of line sentinels and BITNET
chomping.

--------------------------------------------------------------------------------

From: jms@tardis.tymnet.com (Joe Smith)

> it is better if such things insert a single character at the beginning
> of each line (such as ">", " ", or "=").
> [ I don't know what the selector buys you.  (stuff on quoted formfeeds)
>  The ">" vs " " distinction is not necessary here, unlike with shar.

If you don't put an extra character in front of every line, you will find
that some of the sources won't work, depending on which systems the posting
was forwarded through.  Newsgroups such as this one are often forwarded to
people who don't have access to a full news feed.  Many mailers will slightly
mangle the posting during delivery.  Some known problems are:
	1) Removal of all trailing blanks on a line
	2) Padding every line with blanks until it is 80 columns wide
	3) Translating 2 consecutive linefeeds to <lf><space><lf>
	4) Translating <lf><space><lf> to <lf><lf>
	5) Translating "From " at start of line to ">From ".
	6) Aborting on any line with a period in column 1.

The last four problems can be avoided by prefixing every line of the source
with an agreed upon character.  Considering that sources from this newsgroup
will be going through Unix mailers and BITNET gateways, the use of a prefix
character is not just an option; it is mandatory.

Don't take my word for it - talk to the moderators of the other source groups
and ask them what kind of mangling can happen to posted sources, and what
methods work best.  (CP/M-80 did not have UNSHAR, and ran into these same
problems in the days of net.micro.)

--------------------------------------------------------------------------------

From: Doug Gwyn (VLD/VMB) <gwyn@brl.mil>

I believe that it is the job of the mail exchange facilities to take
care of differences in host text file formats, which is what items 1,
2, 3, and 4 are due to.  In fact there are even more variations than
those mentioned.  We had to consider such stuff when drafting X3.159.

Item 5 is caused by an inferior UNIX mailer that can be tracked back
to at least Sixth Edition UNIX (circa 1975).  That mailer did not have
any means of distinguishing between headers and message bodies, which
it stupidly kept in the same mailbox file, so it simply assumed that a
line starting with "From" started a header.  It didn't take long for
people to notice that the mail reader misinterpreted some messages due
to this assumption not being quite perfect, so the mailer was hacked to
stick a ">" in front of mailed text lines that started with "From".
However, this is not an unambiguously invertible mapping, and in fact
the mail reader was not simultaneously modified to even attempt to undo
the mapping.  This was never a correct solution for intersystem exchange
of mail, for which the simplistic rule about the structure of headers is
seldom correct, and shouldn't be encountered on systems that are capable
of receiving Usenet news or mail from alien systems, which is what we
are talking about for these archives.

Item 6 is due to the lack of a correct mapping between the sent message
and the form in which it appears at a "mail server port".  There are
numerous such warts known among various mailers, including problems
with lines starting with ~ characters.  Problems occur when mailers
send the message directly to a mail server port without regard for the
fact that special characters in the message body could cause inadvertent
control actions; a correct solution to in-band control requires mapping
on both ends.  As an illustration of the fact that this can be done, I
have no problem at all sending and receiving such messages via our MMDF
facilities, which conform to SMTP.

The bottom line is that misinterpretation of message characters is a
generic concern of certain mailers, so the mailers should take care of
their own self-imposed problems, not impose constraints on the content
of messages that can be reliably sent through those facilities.  If the
providers of the mail software fail to do their job properly, as seems
to often be the case, then whatever means are used to gateway between
Usenet newsgroups and sites using inferior mailers should apply the
needed additional "wrapper" protection to circumvent the problems.  As
a matter of proper design, this should not be left up to individual
messages or newsgroups, but should be addressed on a global basis.

As far as simply different text file formats is concerned, not much
can be done about that other than to realize that it can happen.  The
native text-file formats will be used at some point during the transfer,
unpacking, and use of text archives among systems.  Thus, significant
spaces at the ends of lines should be avoided; this is not usually a
problem for source code in traditional programming languages nor for
documentation.  Most "control characters" should also be avoided, since
not all hosts will have ways to represent all control actions in
conjunction with text storage.  One might argue that a solution would be
to require all host formats and support for text files to conform to
relevant standards, but there is zero chance of that ever happening.

Posted source archives will be generally much more useable if their
posted form is as close to "virgin" as possible.  As a matter of fact,
I don't expect mailer warts to cause much trouble in our case anyway,
since VERY few Apple II sources would have lines starting with ., ~,
etc.

--------------------------------------------------------------------------------

From: Doug Gwyn (VLD/VMB) <gwyn@brl.mil>

I personally think there is no excuse for BITNET mailers to fail to
conform to the official mail protocols that provide for transparent
delivery of arbitrary text messages.

The problem with adding still more "escapes" in the archiving scheme
is that there is no end to them.  Will ! characters have to be escaped
in the middle of lines?  How can we possibly guess whether or not that
will make somebody's mailer sick?  It is possible to assume a "lowest
common denominator" and map everything into that form, but then the
files end up looking like "uuencoded" data, which is a severe drawback.
Note that even ASCII HT (tab) characters can be problematic for some
systems, yet any standard-conforming implementation of C must allow
them in C source code.  Note also that ~ and ^ characters are commonly
found in C and Pascal source code, while $ is found in AppleSoft BASIC
source code.  Yet $ is not available on all systems, and who knows to
what it might get mapped?  Are we supposed to escape $ also?  This is
obviously a bottomless pit.  Why is BITNET entitled to special
consideration for problems that are merely due to poor implementation
of mail gatewaying?  If they connect to the Internet, aren't they
obliged to conform to Internet standards at the interface?

Perhaps it is time to make a stand, and tell BITNET sites that we are
tired of kludging around a problem that we have every right to expect
BITNET administration to have solved long ago.

--------------------------------------------------------------------------------

[ Some overall comments.  I end up with files butchered by mailers;
  leading periods are doubled and a leading "From" becomes ">From".
  I've always regarded this as a disaster.   Adding a sentinel does
  fix the problem, although the right way to solve the problem is
  to fix the mailers suffering from brain damage.   Anyway, I am
  willing to adopt a sentinel at the beginning of lines:
	file-name
	-line_1
	-...
	-line_n
  For example,
	test.c
	-main()
	-{
	-...
	-}
  I think this solves the problem and is not too ugly.
 ]

--------------------------------------------------------------------------------

To get something posted to comp.sources.apple2, send it to:
	Internet: jac@paul.rutgers.edu
	UUCP: rutgers!paul.rutgers.edu!jac

Please mark comments that are not to be posted with "Not for Posting".
Otherwise, I often can't tell.


Jonathan A. Chandross
Internet: jac@paul.rutgers.edu
UUCP: rutgers!paul.rutgers.edu!jac