[comp.mail.misc] International character set requirements needed

hansen@pegasus.att.com (Tony L. Hansen) (12/31/90)

<< From: hansen@pegasus.att.com (Tony L. Hansen)
<< By the way, the SMTP protocol doesn't permit 8-bit data. This limits
<< mailers which must send mail using that protocol.

< From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
< True.  But there is no technical reason (other than short-sightedness)
< why SMTP has to strip off the 8th (high) bit.  There are in fact
< working versions of sendmail that don't disturb the 8th bit.

I agree completely, there is no reason to limit SMTP to 7-bits.
Unfortunately, the standard currently REQUIRES the stripping and doing
anything else is non-standard. I would definitely support changing the
standard to allow an arbitrary 8-bit byte stream. This would also require
eliminating the limitation of 1024-byte lines and anything else in the
standard which is not content transparent.

System V release 4 mail is completely content transparent. As long as the
transport media is capable of handling the mail, SVr4 mail will be able to
get it to you unchanged. Unfortunately, it can't do so over SMTP
connections.

Since this discussion is going somewhat away from the bounds of comp.text,
I've added comp.mail.misc to the Newsgroup list.

					Tony Hansen
				att!pegasus!hansen, attmail!tony
				    hansen@pegasus.att.com

barmar@think.com (Barry Margolin) (12/31/90)

In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes:
>System V release 4 mail is completely content transparent. As long as the
>transport media is capable of handling the mail, SVr4 mail will be able to
>get it to you unchanged.

What does it do when sending textual mail to a system that doesn't use
ASCII encoding, e.g. an IBM mainframe, or to a system with a different
newline convention (e.g. CRLF rather than LF)?  SMTP places restrictions on
the characters that may appear in a message to support automated
translation during the transfer process.

--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

keld@login.dkuug.dk (Keld J|rn Simonsen) (01/02/91)

hansen@pegasus.att.com (Tony L. Hansen) writes:

>< From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
>< True.  But there is no technical reason (other than short-sightedness)
>< why SMTP has to strip off the 8th (high) bit.  There are in fact
>< working versions of sendmail that don't disturb the 8th bit.

This introduces a problem with "embedded slashes" which are now
represented internally in Sendmail with the 8th bit set.
Have anybody got Sendmail patches to remedy this?

>I agree completely, there is no reason to limit SMTP to 7-bits.
>Unfortunately, the standard currently REQUIRES the stripping and doing
>anything else is non-standard. I would definitely support changing the
>standard to allow an arbitrary 8-bit byte stream. This would also require
>eliminating the limitation of 1024-byte lines and anything else in the
>standard which is not content transparent.

I am much in favour of extending the character set supported by SMTP.
But you should be careful. What is the meaning of a 8-bit character?

Well, depends on the character set employed. Today we know that only 7-bit
ASCII is allowed. But with 8-bit mail, is this octal code 0162 coming
over the line  an "small a with acute accent" (as in ISO 8859-1:1987), a
Cent sign (as in IBM CP 437) or a "capital A with circumflex" (as
in HP Roman8)? This might become a real problem given the current
shares on the UNIX market. Just displaying the 8bit data to a user
may be very confusing. It may even do strange things to your terminal equipment
if IBM Codepage character set is employed, as some of the characters
here are in the upper control character sets of ISO 8859 and 
other vendors chararacter sets.

Should one then just say "Use ISO 8859"? Well, what ISO 8859?
There are several parts, latin 1, latin 2 (eastern Europe),
Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned
character 0162 has different meanings in these different character sets.
ISO 8859-1 would be the natural choice (and is also specified in a recent
RFC on encoding: header.) But is that fair? I think that is like inventing
a new ASCII, only capable of serving one region of the world 
sufficiently - this time having Western Europe (EEC) and all of
North and South America covered. We should do something that could
cover the whole world.

It is also quite hard to persuade your manufactures to change their
implementation character set, and even worse for equipment you already
have bought and installed. Some of this may even be running software
with no 8-bit capabilities!

I think it would be nice to be able to support all of these new and
oldie systems, and I have done an implementation of Sendmail capable
of supporting more than 60 character sets. It currently does not touch
the headers, but only the mail body. For characters not in the 
current character set, it encodes this character with a mnemonic 
code, for example a'  for the above mentioned "small a with acute".
Thus even in ASCII you can get the message!

The sendmail patches are available with anon ftp in dkuug.dk:pub/ch.shar
and sm5.64.8+bit.pa (sm.8+bit.pa for 5.61). Its about 100 kb - the Sendmail
patches itself is under 100 lines, the rest is the character set stuff.
It has been running here at dkuug.dk since Feb 90.

A new ISO standard is showing up: ISO 10646 (which just has been
published as a Draft International Standard (DIS)).
This covers all characters in the world, with very few exceptions.
And the exceptions are planned to be included in a later issue.
Actually Dan Oscarsson and I have been planning (mostly Dan)
to do a SMTP implementation for Sendmail negotiation 10646 for
transmission, and write an RFC for this character set negotiation.

Keld Simonsen

les@chinet.chi.il.us (Leslie Mikesell) (01/02/91)

In article <1990Dec31.013538.9473@Think.COM> barmar@think.com (Barry Margolin) writes:
>In article <1990Dec31.004055.10335@cbnewsk.att.com> hansen@pegasus.att.com (Tony L. Hansen) writes:
>>System V release 4 mail is completely content transparent. As long as the
>>transport media is capable of handling the mail, SVr4 mail will be able to
>>get it to you unchanged.

>What does it do when sending textual mail to a system that doesn't use
>ASCII encoding, e.g. an IBM mainframe, or to a system with a different
>newline convention (e.g. CRLF rather than LF)?  SMTP places restrictions on
>the characters that may appear in a message to support automated
>translation during the transfer process.

But the automated translation can currently only work with text while
many mailers are now capable of attaching arbitrary binary data to
messages.  Depending on the type of the content, a different transformation
(or none) may be desired.  Assuming that the non-textual portions are
encapsulated with "Content-Type:" and "Content-Length:" headers, it
would be easy for the transport to determine what, if any, transformation
to use.  In addition, an optional "Encoding-Method:" header can allow
temporary transformations to meet the character set requirements of the
transports.  If the sending program had a way to determine the capabilities
of the recipient, encoding could be done on-the-fly, using uuencode or
atob, and thus only done where necessary (but I don't know of anyone
actually doing this yet...).

These issues are going to have to be addressed for messages originating
on X.400 systems anyway, so why not try to do it efficiently by adding
the equivalent functionality to SMTP/uucp mailers?

Les Mikesell
  les@chinet.chi.il.us