[comp.mail.misc] 8-bit mail

ct@dde.uucp (Claus Tondering) (05/12/89)

More and more companies (especially in Europe) are moving from a
7-bit pseudo-ASCII environment to an 8-bit environment (typically
based on the ISO 8859/1 character set). Our company has been using
this 8-bit character set for some years now. But we have problems with
E-mail. Within our organization uucp transfer of E-mail with 8-bit characters
works fine, but if our mail leaves the organization and goes to this
country's backbone machine, the 8th bit is removed from our letters.

The reason, I am told, is that E-mail is based in a set of RFCs that
specify 7-bit ASCII as the character set to use, and therefore characters
with the 8th bit set are stripped. Why must it be so? Uucp has no problems
with 8-bit characters, so why must we restrict ourselves to a standard
that is dying anyway?
-- 
Claus Tondering
Dansk Data Elektronik A/S, Herlev, Denmark
E-mail: ct@dde.dk    or    ...!uunet!mcvax!dkuug!dde!ct

prc@erbe.se (Robert Claeson) (05/16/89)

In article <557@Aragorn.dde.uucp> ct@dde.uucp (Claus Tondering) writes:

>More and more companies (especially in Europe) are moving from a
>7-bit pseudo-ASCII environment to an 8-bit environment (typically
>based on the ISO 8859/1 character set). Our company has been using
>this 8-bit character set for some years now. But we have problems with
>E-mail. Within our organization uucp transfer of E-mail with 8-bit characters
>works fine, but if our mail leaves the organization and goes to this
>country's backbone machine, the 8th bit is removed from our letters.

Some do, some don't. Our SysVR3.1 systems with sendmail don't strip the eight
bit (I know for sure, we always use ISO 8859/1 within our organization), but
most (all?) BSD systems do.

>The reason, I am told, is that E-mail is based in a set of RFCs that
>specify 7-bit ASCII as the character set to use, and therefore characters
>with the 8th bit set are stripped. Why must it be so? Uucp has no problems
>with 8-bit characters, so why must we restrict ourselves to a standard
>that is dying anyway?

RFC is an American kind of "standard". ASCII is an American standard, too.
The Americans only need 7 bits, so don't expect the RFC to change within
this century. I guess we'll have to wait for X.400 to become more widespread
(anyone got a PD X.400 MUA/MTA?).

-- 
          Robert Claeson      E-mail: rclaeson@erbe.se
	  ERBE DATA AB

rja@edison.GE.COM (rja) (05/22/89)

In article <557@Aragorn.dde.uucp> ct@dde.uucp (Claus Tondering) writes:
 
% More and more companies (especially in Europe) are moving from a
% 7-bit pseudo-ASCII environment to an 8-bit environment (typically
% based on the ISO 8859/1 character set). Our company has been using
% this 8-bit character set for some years now. But we have problems with
% E-mail. Within our organization uucp transfer of E-mail with 8-bit characters
% works fine, but if our mail leaves the organization and goes to this
% country's backbone machine, the 8th bit is removed from our letters.

AT&T's UNIX System V.3 and later is 8-bit transparent, but BSD UNIX
uses the high-order bit to store certain characteristics.  I am hopeful
that whenever 4.4 BSD comes out that it will be 8-bit transparent as well.

The problem will be that there are still a lot of systems out there
based on OSs that mangle the high-order bit and so the transition to
using ISO 8859 across the Internet as a whole will probably be slow.
Nevertheless, it seems clear that the Internet needs to move to support
ISO 8859 character sets.  

In the meantime, try to ensure that if you need an 8-bit character set
that you purchase an OS (like System V.3 & later) that will support it.

storm@texas.dk (Kim F. Storm) (06/08/89)

ct@dde.uucp (Claus Tondering) writes:

>More and more companies (especially in Europe) are moving from a
>7-bit pseudo-ASCII environment to an 8-bit environment (typically
>based on the ISO 8859/1 character set). Our company has been using
>this 8-bit character set for some years now. But we have problems with
>E-mail. Within our organization uucp transfer of E-mail with 8-bit characters
>works fine, but if our mail leaves the organization and goes to this
>country's backbone machine, the 8th bit is removed from our letters.

>The reason, I am told, is that E-mail is based in a set of RFCs that
>specify 7-bit ASCII as the character set to use, and therefore characters
>with the 8th bit set are stripped. Why must it be so? Uucp has no problems
>with 8-bit characters, so why must we restrict ourselves to a standard
>that is dying anyway?

Other people have pointed out the technical reasons for the stripping of the 
eight bit (the Danish backbone is a BSD based system).  It has alse been
pointed out, that there is little hope that all 8 bit messages will
survive, and that we will have to live with this draw-back until X.400
comes around and solves all our problems.

However, at least in Europe where the need for 8-bit character sets is
greater than in the U.S., the Backbones on EUnet should run mailers that
can forward 8-bit data unchanged (not the situation today).  If an end-site
cannot handle 8-bit data it is their problem.

However, even if the backbones and the end-sites would pass on 8-bit data,
we would still be faced with the problem that the messages may be read on
a terminal with a different character set than the one it was posted from.
The poster will have to tell which character set his message is written in,
and the recipient must use a terminal which can show this character set,
or his mail program must be able to remap the characters in a sensible way.

I think that a new header field would be required for this, e.g.
Character-Set: 8859/1 
and if that is missing, ASCII (8859/0 ?) is assumed.  

Is the concept of characters sets defined in X.400 or are we going 
to have the same problem there?

-- 
Kim F. Storm        storm@texas.dk        Tel +45 429 174 00
Texas Instruments, Marielundvej 46E, DK-2730 Herlev, Denmark
	  No news is good news, but nn is better!

recerik@alliant.uni-c.dk (Erik Bertelsen) (06/14/89)

X.400 in the 1984 version supports teletex and the IA5 (ASCII) character
encodings. The 1988 version adds some more options, but not 8-bit encodings
like ISO 8859/1. To my best knowledge X.400 in its current status will not
help us poor souls in countries with alphabets with more letters than the
26 letters used in English. - sigh!

But making mail messages written in ISO 8859/1 (and other flavors of 8859) 
work would be very nice. As Kim suggests we may have to invent yet another
header line to accomplish this - sigh again !

- erik

prc@erbe.se (Robert Claeson) (06/19/89)

In article <RECERIK.89Jun14110610@alliant.uni-c.dk> recerik@alliant.uni-c.dk (Erik Bertelsen) writes:

>X.400 in the 1984 version supports teletex and the IA5 (ASCII) character
>encodings. The 1988 version adds some more options, but not 8-bit encodings
>like ISO 8859/1. To my best knowledge X.400 in its current status will not
>help us poor souls in countries with alphabets with more letters than the
>26 letters used in English. - sigh!

Really? I thought that x.400 did use the iso 6937 8-bit character set (which
is a "true" superset of iso 8859 according to the copy of the standard doc
that I have and thus can be mapped into ascii, the various iso 646 charsets,
the various iso 8859 character sets and I guess a number of related sets as
well).

-- 
          Robert Claeson      E-mail: rclaeson@erbe.se
	  ERBE DATA AB

huitema@mirsa.inria.fr (Christian Huitema) (06/19/89)

From article <733@maxim.erbe.se>, by prc@erbe.se (Robert Claeson):
> In article <RECERIK.89Jun14110610@alliant.uni-c.dk> recerik@alliant.uni-c.dk (Erik Bertelsen) writes:
< 
>>X.400 in the 1984 version supports teletex and the IA5 (ASCII) character
<<encodings. The 1988 version adds some more options, but not 8-bit encodings
>>like ISO 8859/1. To my best knowledge X.400 in its current status will not
<<help us poor souls in countries with alphabets with more letters than the
>>26 letters used in English. - sigh!
< 
> Really? I thought that x.400 did use the iso 6937 8-bit character set (which
< is a "true" superset of iso 8859 according to the copy of the standard doc
> that I have and thus can be mapped into ascii, the various iso 646 charsets,
< the various iso 8859 character sets and I guess a number of related sets as
> well).

ISO 6937 is certainly not supported in X.400-1984. But nothing forbids you
to write a translator between this alphabet and the TELETEX alphabet T.61,
which is almost a superset of ASCII, and can express all the various form of
accentuated letters in 6937 -- plus some more, just to please the Malteses
and the Romanians..

T.61 is ``almost'' a superset for the status of the nationally definable
code positions of IA5 is very unclear. It seems to be forbidden to use them
in the plain CCITT T.61, which forbids the encoding of e.g. braces. There is
an ISO equivalent of T.61, where these characters are defined -- and shall
map to the ``international IA5'', i.e. American...

Christian Huitema

woerz%isaak@isaak.uucp (Dieter Woerz) (06/20/89)

In article <RECERIK.89Jun14110610@alliant.uni-c.dk> recerik@alliant.uni-c.dk (Erik Bertelsen) writes:
>X.400 in the 1984 version supports teletex and the IA5 (ASCII) character
>encodings. The 1988 version adds some more options, but not 8-bit encodings
>like ISO 8859/1. To my best knowledge X.400 in its current status will not
>help us poor souls in countries with alphabets with more letters than the
>26 letters used in English. - sigh!

Why do we have to wait? As far as I know, teletex is capable to send
all or nearly all characters used in the ISO 8859/1 writing world.
Let you MUA convert the ISO characters into teletex when sending mail
and convert it to ISO 8859/1 when reading mail.

Dieter Woerz
ISA GmbH, Azenbergstr. 35 D-7000 Stuttgart-1 W-Germany
UUCP:           {pyramid!iaoobel,uunet!unido}!isaak!woerz
BITNET/EARN:    woerz@ds0iff5

gunnar@hafro.is (Gunnar Stefansson) (06/22/89)

From article <759@isaak.UUCP>, by woerz%isaak@isaak.uucp (Dieter Woerz):
> In article <RECERIK.89Jun14110610@alliant.uni-c.dk> recerik@alliant.uni-c.dk (Erik Bertelsen) writes:
>>X.400 in the 1984 version supports teletex and the IA5 (ASCII) character
>>encodings. The 1988 version adds some more options, but not 8-bit encodings
>>like ISO 8859/1. To my best knowledge X.400 in its current status will not
>>help us poor souls in countries with alphabets with more letters than the
>>26 letters used in English. - sigh!
> 
> Why do we have to wait? As far as I know, teletex is capable to send
> all or nearly all characters used in the ISO 8859/1 writing world.
> Let you MUA convert the ISO characters into teletex when sending mail
> and convert it to ISO 8859/1 when reading mail.

Actually, the MUA and MTA should be completely transparent in this
regard. That's a heck of a lot easier to accomplish than to change all
MUAs to conform to a given transformation standard. There is really no
good reaon why existing mailers should fiddle around with the 8th bit.

If you try to hack things by using encoding features at the MUA level,
you'll just get into trouble later, e.g.  when using the MTA for saving
a piece of mail into a file or stuffing it through a pipe into an
arbitrary program (common in sendmail, for example).

In this country we use elm, smail and sendmail (along with news), which
are fully transparent.  We therefore use ISO 8859/1 exclusively for all
mail (and news). 

Most programs involved need some hacking to accomplish this, due to all
sorts of silly assumptions used by the programs.  Stripping of the 8th
bit is never *needed* at the MUA or MTA levels, but presumably it
is very tempting for an 7-bit ascii-speaker to use the 8th bit for
something else :-)

What we need is for more companies to distribute clean mailers.  Some
already do this (HP, to name one), but others do not (Sun, to name
another) . 

Gunnar
-- 
-----------------------------------------------------------------------------
Gunnar Stefansson                       Uucp: {mcvax,enea}!hafro!gunnar 
Marine Research Institute		Internet: gunnar@hafro.is
P.O. Box 1390,Reykjavik    		Tel: +354 1 20240 Fax: +354 1 623790

prc@erbe.se (Robert Claeson) (06/25/89)

In article <127@hafro.is> gunnar@hafro.is (Gunnar Stefansson) writes:
>From article <759@isaak.UUCP>, by woerz%isaak@isaak.uucp (Dieter Woerz):

>> Why do we have to wait? As far as I know, teletex is capable to send
>> all or nearly all characters used in the ISO 8859/1 writing world.
>> Let you MUA convert the ISO characters into teletex when sending mail
>> and convert it to ISO 8859/1 when reading mail.

>Actually, the MUA and MTA should be completely transparent in this
>regard. That's a heck of a lot easier to accomplish than to change all
>MUAs to conform to a given transformation standard. There is really no
>good reaon why existing mailers should fiddle around with the 8th bit.

So how should we then get the messages encoded in the IA5 teletex character
set? Get a teletex terminal? :> No, I want to be able to use my terminal
with the ISO 8859/1 character set, and I want to be able to communicate
with some guy on an IBM mainframe, a PC, a machine with 7-bit ISO 646
terminals (that's ASCII in the U.S.) or even a HP machine with HP's
terminals using the Roman-8 character set -- WITHOUT having to do the
character set translation in my brain (as I have to do now using RFC-
whatever mail). I'd really like to have some machine- or terminal-independent
representation of the messages and have a program convert its contents
into the character set my terminal happens to use. The question is just
if this translation should be performed by the MTA, the MUA or by some
program which the MUA pipes the message through before sending it to my
termial. And how should the particular program get to know what character
set my terminal is using (I think I just found a use for the SVR3 "CHRCLASS"
environment variable).
-- 
          Robert Claeson      E-mail: rclaeson@erbe.se
	  ERBE DATA AB

amanda@intercon.uu.net (Amanda Walker) (06/25/89)

This seems to be the year of the international character set :-).

A large proportion of our customer base is located in western Europe,
and so we've gotten pretty familiar with operating in environments that
are not limited to 7-bit U.S. ASCII.  Unfortunately, there are a lot
of standards to choose from :-), even if you stick to one from the ISO.

If and when mail and news paths become 8-bit transparent (which I think
would be a good idea), the situation will improve, as long as everyone
cooperates.  It sounds like a lot of the European UNIX community has
standardized on ISO 8859/1, which is a step forward from ISO 646 (since
it greatly widens the geographical area served by a single character set),
but it still only puts the problem off for a while, and is only really
useful for most of western Europe.  Eastern Europe, parts of the
Mediterranean, and the Pacific Rim countries are still left high and
dry (to name a few).  They don't have much presence in the global E-mail
networks now, but it will only increase.

Even in western Europe, character sets are still a problem.  There are an
awful lot of people out there still using the DEC Multinational Character
Set, which is similar to but not the same as ISO 8859/1.  There are a lot
of people using National Replacement Character Sets as well, although these
are starting to go away as time goes on.

One of the biggest problems I have in writing code for MUAs and NUAs (News
User Agents :-)) is determining what character set a given message is using.
One thing I would really like to see is for MUA's to start using the
Content-Type: field (or at least X-Content-Type:) in RFC 822 messages.
This way the MUA can have a set of common standards it knows about, and can
translate to whatever the user wants without lots of fancy footwork.  Also,
for some representations, such as ISO 2022 or subsets thereof, you can even
send things transparently through 7-bit channels as long as they don't filter
out the ESC character.

ISO 8859/1 is just the start.  Eventually, I hope the ISO finishes their
multibyte character set standard (10646?), but who know when that will
happen...

--
Amanda Walker
InterCon Systems Corporation
--
amanda@intercon.uu.net  |  ...!uunet!intercon!amanda

prc@erbe.se (Robert Claeson) (06/29/89)

In article <24-Jun-89.210351@192.41.214.2> amanda@intercon.uu.net (Amanda Walker) writes:

>It sounds like a lot of the European UNIX community has
>standardized on ISO 8859/1, which is a step forward from ISO 646 (since
>it greatly widens the geographical area served by a single character set),
>but it still only puts the problem off for a while, and is only really
>useful for most of western Europe.  Eastern Europe, parts of the
>Mediterranean, and the Pacific Rim countries are still left high and
>dry (to name a few).  They don't have much presence in the global E-mail
>networks now, but it will only increase.

You forgot the native's character sets in Scandinavia, used for languages
such as Lappish. But ISO has solutions for these problems (as long as people
cooperates)...

>Even in western Europe, character sets are still a problem.  There are an
>awful lot of people out there still using the DEC Multinational Character
>Set, which is similar to but not the same as ISO 8859/1.

And I believe that there is an ANSI 8-bit character set as well, of which
Microsoft uses a subset similar to the DEC Multinational character set.
And then there are people with HP terminals who uses HP's Roman-8 character
set which is quite dissimilar to the ISO 8-bit character sets, not to mention
all IBM users out there with their PC's, AT's, PS/2's and even RT's that uses
IBM's very own ASCII-superset 8-bit character set (and I don't mean EBCDIC).

>One of the biggest problems I have in writing code for MUAs and NUAs (News
>User Agents :-)) is determining what character set a given message is using.

And how to invoke that character set  on the terminal currently in use, or
how to map it into one of the character sets the terminal has, or how to
display the message if the temrinal doesn't support the character set
the message is written in.

>One thing I would really like to see is for MUA's to start using the
>Content-Type: field (or at least X-Content-Type:) in RFC 822 messages.
>This way the MUA can have a set of common standards it knows about, and can
>translate to whatever the user wants without lots of fancy footwork.
.....

Yes, that would be nice. Any volounteers?

>ISO 8859/1 is just the start.  Eventually, I hope the ISO finishes their
>multibyte character set standard (10646?), but who know when that will
>happen...

In the meantime, use the 8-bit character sets defined in the ISO 8859
standard (I believe that there are 9 of them now) with the code shift
techniques described in some other ISO standard whose number I forgot.
This way, different ISO-standardized character sets can be used in the
same message to, say, write in English, mention some German names and
quote a few Lappish phrases. I believe that AT&T UNIX SVR4 will use
this technique as part of its native language support system.

Now, is there anyone who knows of a terminal that supports all the ISO
8859 character sets and the code shift sequences?  :-)

-- 
          Robert Claeson      E-mail: rclaeson@erbe.se
	  ERBE DATA AB