[comp.mail.misc] Tab Expansion in E-mail

toddp@hp-ptp.HP.COM (Todd_Poynor) (02/28/90)

Message bodies containing the Horizontal Tab character (ASCII 9) pose quite
a problem to Mail User Agents: it is impossible to know how to correctly
reproduce the original behavior of the tab on a recipient's display
device.  That is, the tab stops defined at the sending user's terminal may
not correspond to the tab stops defined at the terminal of the destination
user(s).  Although for UNIX systems tabs at every 8 character positions is
fairly standard, this is not the case for other flavors of hosts which send
and receive Internet mail.  Indeed, on certain hosts and display devices
the tab character is not normally understood as a horizontal tab at all.
Misaligned columns on reports, often to the point of
near-incomprehensibility, are a constant annoyance to users in such a
situation.
  
Two obvious means of solution are apparent: to avoid use of that character in
text messages which may potentially be received by someone with differing tab
stops, or to include information with the message which informs the
destination User Agent of the intended tab stops.
  
Avoidance of use can be accomplished by simply not pressing the Tab key
when entering messages, but we creatures of habit usually find this hard
to remember not to do.  Messages can be filtered through a process which
locally expands tabs to blanks before dispatching the message, but again,
it is difficult to remember to do this if not automatically done.  If the
filtering is performed automatically, it has the undesirable effect of
corrupting certain verbatim-text usages, such as within messages containing
files to be transferred "as is".  A familiar UNIX example of such
corruption is the expansion of tabs within messages containing "shar"
archived files, where the receiving process may detect that the received data 
does not match the data originally sent. 
  
Automatic expansion of tabs may be feasible if some means of preventing
unwanted expansion is provided.  For the unusual case of mailing verbatim
text it is perhaps not overly difficult to remember to include some sort
of header information or text marker which inhibits message body modification.
Taking a cue from the privacy enhancement RFCs, a text marker such as:
  
      -----TEXT PROTECTION BOUNDARY-----
  
could indicate the end of text subject to detabbing or any other conceivable
text modifications.  This marker may have to be recognized even within
encapsulated messages (messages within messages, as per RFC 934), where a
"- " would be prefixed to the marker.  The marker could even be automatically
generated by archiving software at the top of the archive.
  
Aside from the general impression of inelegance left by the text marker
solution on many computer literates (including this author), the practice
of automatic text modifications strikes some as a gross violation of
data communications protocol.  Although it can be argued that such tab
expansion falls under the category of approved cross-host translations along
with local character set translation, the general feeling is that the
original content of the message should be preserved to the greatest extent
possible.
  
For this reason, a preferred method might be to preserve the tab characters
within the message, and include information in the message header which
informs User Agents what the proper tab settings are.  This information
would normally correspond to the tab stops which were set at the sending
user's terminal.  For mail sent by automatic means where no terminal can
be identified with the creation of the message either a local default may
be given, or the information may be omitted, indicating that, as in the
present-day situation, the tab behavior is up to interpretation by the
destination.
  
A new header field is probably in order for this purpose.  In absence of
a standard, the user-defined field nomenclature of prefixing the field
name with "X-" has been suggested for prototype implementations.  The
proposed field syntax in RFC 822 notation is:
  
      tab-define   =  "X-Tab-Stops" ":" 1#(tab-posn / tab-incr)
      tab-incr     =  "+" 1*DIGIT
      tab-posn     =  1*DIGIT
  
This syntax defines a field named "X-Tab-Stops" which takes as an argument
a comma-separated list of numerical values, each optionally preceded with a
plus sign.  The list should include at least one of these values, and each
value is a string of at least one decimal digit.  Each of these values is
interpreted as the definition of the next tab stop in left-to-right order
across the destination display.  The interpretation of each value is as
follows:
  
      o   If it is a tab-posn (that is, is not preceded by a plus sign)
          the value is the character position of the next tab stop,
          where the first character in the line is numbered one.
  
      o   If it is a tab-incr (preceded with a plus sign) the value is
          a number of characters relative to the character position of
          the preceding tab stop at which the next tab stop is to be set.
          If there has been no previous tab stop definition, meaning that
          this is the first item in the list, the increment is relative
          to character position 1.  If it is the last item in the list
          this increment applies indefinitely, such that the effect is to
          have an infinite number of tab stops set from this position
          forward, each with this same character position increment
          between them.
  
So for UNIX users with tab stops every 8 characters this might appear as:
  
      X-Tab-Stops: 9, 17, 25, 33, 41, 49, 57, 65, 73
  
or more succinctly:
  
      X-Tab-Stops: +8
  
A typical setting for FORTRAN programmers might be:
  
      X-Tab-Stops: 7, +3
  
which sets the first tab at position 7, the start of the statement area,
and every 3 positions thereafter.
  
One possible modification to the argument syntax is to delete the commas,
using blanks as separators between items for efficiency (RFC 822 favors
the shown syntax for lists).
  
This header field is intended to be interpreted by Mail User Agents at
message viewing time.  Tab characters in the body are expanded as blanks,
according to the tab stops defined in the field.  Of course, use of other
control characters or characters outside the standard printing subset may
cause the User Agent to have an incorrect notion of the current character
position at expansion time.  This is not expected to be a problem in most
text messages of the sort normally used in inter-host environments.
  
This solution addresses the problem of reading tabbed messages at the
presentation level, which many feel is appropriate.  Not specifically
addressed is the problem of saving the message text in the local file
system, where detabbing a particular message may be required or may be
prohibited, depending on the intended use of the file.  Conceivably,
the same software which displays messages on terminals can perform the
conversion into files, leaving execution of this software for file storage
to user discretion.
  
Digests may require interpretation of the "X-Tab-Stops" field at each
encapsulated message header by presentation software, or digestification
software may convert encapsulated messages to a common tabbing convention.
  
A subject of controversy is whether gateways to foreign mail systems not
adhering to any such tab stop representation should expand the tabs
contained in the body according to the "X-Tab-Stops" field during transfer.
Bearing the aforementioned warnings about data corruption in mind, this
author recommends the data be passed unretouched; let the foreign
community demand such a capability in that mailer if deemed of sufficient
importance.
  
Obviously, the problem of tab representation is difficult to solve, perhaps
more difficult than is warranted by the relatively minor consequence
involved.  If you have any thoughts on a simpler solution I welcome your
suggestions.
  
  
Submitted for your approval,
  
Todd Poynor    HP Data Systems Operation     todd@hpepoc.hp.com  408/746-5185

Craig_Everhart@transarc.com (03/01/90)

It seems a shame to burn (human) cycles on a minor problem when there
are so many larger fish to fry.  I would refer the reader to RFC 1049
for the description of a Content-Type: header in messages that reaches
far beyond simple tab-stop specification.

I've gotten reasonably used to viewing all messages in a variable-pitch
font, switching to a fixed-pitch one only when I want to see the
ASCII-graphic information that somebody has written.  Yes, my mail
reader assumes it knows what tab characters mean, but I could teach it
other things by extending content-type rather than by inventing yet
another header.

Todd Poynor has done an effective job of analyzing the problems that
would arise in practice: encapsulation, shar'ing, gatewaying, and the
like.  I just wish he were looking at the Content-type: problems instead
of a simple tab-stop one!

		Craig

Makey@Logicon.COM (Jeff Makey) (03/01/90)

In article <1960003@hp-ptp.HP.COM> toddp@hp-ptp.HP.COM (Todd_Poynor) writes:
>If you have any thoughts on a simpler solution I welcome your suggestions.

People (not autonomous programs acting on behalf of people!) could
convert tabs to the appropriate number of spaces before they send
their mail.  I do this and it works wonderfully.

                           :: Jeff Makey

Department of Tautological Pleonasms and Superfluous Redundancies Department
    Disclaimer: Logicon doesn't even know we're running news.
    Internet: Makey@Logicon.COM    UUCP: {nosc,ucsd}!logicon.com!Makey

toddp@hp-ptp.HP.COM (Todd_Poynor) (03/07/90)

From: Craig_Everhart@transarc.com
  
>It seems a shame to burn (human) cycles on a minor problem when there
>are so many larger fish to fry.  I would refer the reader to RFC 1049
>for the description of a Content-Type: header in messages that reaches
>far beyond simple tab-stop specification.
  
RFC-1049 Content-Type syntax could indeed be used to specify tab
presentation, since the resource-ref allows a local-part to be given, which
in turn allows a quoted-string to be given.  I like this suggestion.  At
first I considered Content-Type inappropriate since it appeared to be
concerned only with "larger fish": the content-type identifiers mentioned in
the RFC are supposedly all that is needed to specify the desired appearance,
that is, the identifiers name standard formats for which the interpretation
should be clear.  The syntax is not really geared toward supplying a more
complex set of rules as required to interpret non-standardized "contents".
  
If the Content-Type field is to be used in this manner, then I suspect that
presentation software will need to handle more than one of these fields in a 
message header, such that one can specify the behavior of tabs within the
larger context of a document format, for example.  Failing this, we require
the screen appearance to be completely defined by a single
content-type/ver-num/resource-ref tuple, probably relegating the tab-stop
definition to an optional part of the resource-ref of a content-type named
"PLAIN-TEXT".  Perhaps this is sufficient for the tab-stop problem, but I
won't be surprised if instances arise where the contents of a message
require interpretation at both the text and overall organization levels.
  
And so on to the question of, "Are we analyzing to death a piddling little
detail when more important issues abound, like mailing POSTSCRIPT files?".
The final paragraph of my original posting anticipated such a viewpoint, and
I freely admit that the accusation has merit.  Part of my motivation to
discuss this problem was to see if concern over mismatched tab stops
would strike a responsive chord in the USENET community.  Such a reaction
is lacking thus far.  The issue is far closer to my heart than those
specifically tackled in RFC 1094 since I work on a computer where tab stops
are entirely a matter of user preference, and deal with the e-mail problem
regularly.  The issue has been brought up a number of times by various people
on Hewlett-Packard's private newsgroups due to the common use here of
systems to which the tab character is almost completely foreign.  I can hardly
believe such a situation is all that rare.  Does anyone else believe this
problem is worth discussing?
  
  
From: Makey@Logicon.COM (Jeff Makey)
  
>People (not autonomous programs acting on behalf of people!) could
>convert tabs to the appropriate number of spaces before they send
>their mail.
  
Not all users of mail systems are aware of the problem, and few of those who
are aware are diligent about either avoiding the tab key or performing such
a conversion.  The basenote mentioned this.  If relying on the user community
never to use tabs is the most feasible solution then many people must learn
to break a deeply ingrained habit.  I do agree, of course, that simple
abstinence is quite a clean solution from a technical standpoint, if not
from a behavioristic one.  The bulk of this discussion has been concerned
with doing the right thing if, alas, a tab character makes it through.
If complete avoidance is desired then the next step is to decide how best
to educate users of the problem, and how to encourage or enforce this
avoidance.
  
^todd