[eunet.followup] Code Page Conversion

sommar@enea.se (Erland Sommarskog) (08/09/90)

(The first attempt didn't seem to leave our site. Apologies if you
are seeing this twice.)

Uwe Geuder (geuder@informatik.uni-stuttgart.de) writes:
>From Keld J|rn Simonsen:
>   I use it in email, it is build into the sendmail we use here,
>   and EUnet has decided to run this on an experimental basis
>   on all the backbones of EUnet.
>
>What does this mean? When I get mail from Sweden, it's still in Swedish
>ASCII (is that SSCII??), which is horrible too read on (US) ASCII devices
>used in Germany (German 7-bit Code is never used here). If I run conv SE US
>on such files they get much prettier. So I can't imagine that any host in
>between has already done it. Or is there no "EUnet backbone" between Sweden
>and Germany?

I get a little anxious here, but I may misunderstand some things
here. I certainly don't want mail I send out to be automatically
transformed when they get out. Yes, I understand that occurrances
of ][\}{| are not nice to read, but it seems a risky business to
translate them straight off. If I use them in an non-Swedish mail,
I usually explain them. With a non-wanted transformation, that
would look a little stupid. (And how does the machine know that
I use an "[" as a dotted capital "A" and not as a left bracket?)
Wouldn't it be better, if this was done at receiver's end on request?

Another question: Through a mailing-list I have indirectly received
a list of two-character code stemming from Keld Simonsen. I don't
know whether it is this one we discuss, but I would assume so. I
must admit that I laid that one aside with the thought: "My God,
how unreadable and what an overkill!" I tend to think I missed
some points with its purpose. Could Keld or anyone else clarify?

And a final question: we are moving into an eight-bit world. Instead
of relying on old standards, why not aim to have Eunet work with ISO
8859/1 instead? (8859 is apparently already obsolete with the recent
changes in Eastern Europe, but that is another matter.)
-- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se

keld@login.dkuug.dk (Keld J|rn Simonsen) (08/10/90)

sommar@enea.se (Erland Sommarskog) writes:

>Uwe Geuder (geuder@informatik.uni-stuttgart.de) writes:
>>From Keld J|rn Simonsen:
>>   I use it in email, it is build into the sendmail we use here,
>>   and EUnet has decided to run this on an experimental basis
>>   on all the backbones of EUnet.
>>
>>What does this mean? When I get mail from Sweden, it's still in Swedish
>>ASCII (is that SSCII??), which is horrible too read on (US) ASCII devices
>>used in Germany (German 7-bit Code is never used here). If I run conv SE US
>>on such files they get much prettier. So I can't imagine that any host in
>>between has already done it. Or is there no "EUnet backbone" between Sweden
>>and Germany?

>I get a little anxious here, but I may misunderstand some things
>here. I certainly don't want mail I send out to be automatically
>transformed when they get out. Yes, I understand that occurrances
>of ][\}{| are not nice to read, but it seems a risky business to
>translate them straight off. If I use them in an non-Swedish mail,
>I usually explain them. With a non-wanted transformation, that
>would look a little stupid. (And how does the machine know that
>I use an "[" as a dotted capital "A" and not as a left bracket?)
>Wouldn't it be better, if this was done at receiver's end on request?

Yes, I share Erland's concerns. You cannot just translate 7-bit [\]
(these 7-bit values are defined as letters in both Swedish and
Danish 7-bit) to ISO 8859-1 Swedish/Danish letters.
What we do at dkuug.dk (the Danish Internet backbone) is transforming
both 8-bit curly braces and Scandinavian letters to 7-bit [\].
The other way, from 7-bit Danish or Swedish to ASCII or some 8-bit code,
we normally do not touch these codes.

The  conversion we do here are mostly for use on 8-bit machines,
where some run ISO 8859-1 and some runs some IBM Codepage.

Doing it at the receivers end: well the receiver needs to know
what information is in there. This information must be generated
on the senders side, who knows what the message is.

>Another question: Through a mailing-list I have indirectly received
>a list of two-character code stemming from Keld Simonsen. I don't
>know whether it is this one we discuss, but I would assume so. I
>must admit that I laid that one aside with the thought: "My God,
>how unreadable and what an overkill!" I tend to think I missed
>some points with its purpose. Could Keld or anyone else clarify?

Yes, I have made a quite elaborate list of character names, 
which is being used for mail. It is designed for worldwide use,
and the world is big. There is about 940 characters in there
covering all 7 and 8-bit character sets I know of. It does not 
yet contain any Japanese nor Chinese character.

The character names are primarily used to identify a character
and to be able to registrate properties of these, such as membership
of a character set or that it is a lower case character, and then
the upper case character can be specified alongside.
It does have some mnemonic value, eg a with dieresis (a-umlaut)
is called "a:". How readable and beautiful this is can always
be discussed, but there are some rules to it which are consistently
applied. It is also been designed with short names of the characters
to improve compactness and translation costs, and also to improve
readability and writability.

>And a final question: we are moving into an eight-bit world. Instead
>of relying on old standards, why not aim to have Eunet work with ISO
>8859/1 instead? (8859 is apparently already obsolete with the recent
>changes in Eastern Europe, but that is another matter.)

I am collaborating with a fellow countryman of yours, Dan Oscarsson
from LTH, on using the new ISO 10646 character set for email.
This character set has almost all characters in the world in a 32 bit
compactable code set.

No ISO 8859 is not outdated. ISO 8859-2 covers Eastern Europe,
and ISO 859-5 covers Russia (Cyrillic). 8859 does not cover 
Japanese and other Eastern character sets, though. This was the reason
we decided on ISO 10646.

Keld Simonsen

sommar@enea.se (Erland Sommarskog) (08/12/90)

Keld J|rn Simonsen (keld@login.dkuug.dk) writes:
>No ISO 8859 is not outdated. ISO 8859-2 covers Eastern Europe,
>and ISO 859-5 covers Russia (Cyrillic). 8859 does not cover
>Japanese and other Eastern character sets, though. This was the reason
>we decided on ISO 10646.

What I meant when I said that 8859 was obsolete is that one year
ago it seemed like you could live with having to change to
another character set to read and write Polish, Hungarian etc,
since the political and econimical situation would make such
cases would be rare. Now when they suddenly are joining the
free world this cases could be expected to be more freequent.
And I am not only talking articles and mail in these languages,
but also multi-language texts. For instance if Lech Walesa ever
appears on Usenet, it would be nice if his name could come up
right with a slashed "l" and cedilla on the "e".

(But of course, with all ancient mailers, archaic Unix varieties,
GNU-Emacs, etc, I'm in doubts that anything than plain seven-bit 
ASCII will ever be regarded comme-il-faut on Usenet. :-)
-- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se