[comp.std.internat] Code Page Conversion

sommar@enea.se (Erland Sommarskog) (08/09/90)

(The first attempt didn't seem to leave our site. Apologies if you
are seeing this twice.)

Uwe Geuder (geuder@informatik.uni-stuttgart.de) writes:
>From Keld J|rn Simonsen:
>   I use it in email, it is build into the sendmail we use here,
>   and EUnet has decided to run this on an experimental basis
>   on all the backbones of EUnet.
>
>What does this mean? When I get mail from Sweden, it's still in Swedish
>ASCII (is that SSCII??), which is horrible too read on (US) ASCII devices
>used in Germany (German 7-bit Code is never used here). If I run conv SE US
>on such files they get much prettier. So I can't imagine that any host in
>between has already done it. Or is there no "EUnet backbone" between Sweden
>and Germany?

I get a little anxious here, but I may misunderstand some things
here. I certainly don't want mail I send out to be automatically
transformed when they get out. Yes, I understand that occurrances
of ][\}{| are not nice to read, but it seems a risky business to
translate them straight off. If I use them in an non-Swedish mail,
I usually explain them. With a non-wanted transformation, that
would look a little stupid. (And how does the machine know that
I use an "[" as a dotted capital "A" and not as a left bracket?)
Wouldn't it be better, if this was done at receiver's end on request?

Another question: Through a mailing-list I have indirectly received
a list of two-character code stemming from Keld Simonsen. I don't
know whether it is this one we discuss, but I would assume so. I
must admit that I laid that one aside with the thought: "My God,
how unreadable and what an overkill!" I tend to think I missed
some points with its purpose. Could Keld or anyone else clarify?

And a final question: we are moving into an eight-bit world. Instead
of relying on old standards, why not aim to have Eunet work with ISO
8859/1 instead? (8859 is apparently already obsolete with the recent
changes in Eastern Europe, but that is another matter.)
-- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se

keld@login.dkuug.dk (Keld J|rn Simonsen) (08/10/90)

sommar@enea.se (Erland Sommarskog) writes:

>Uwe Geuder (geuder@informatik.uni-stuttgart.de) writes:
>>From Keld J|rn Simonsen:
>>   I use it in email, it is build into the sendmail we use here,
>>   and EUnet has decided to run this on an experimental basis
>>   on all the backbones of EUnet.
>>
>>What does this mean? When I get mail from Sweden, it's still in Swedish
>>ASCII (is that SSCII??), which is horrible too read on (US) ASCII devices
>>used in Germany (German 7-bit Code is never used here). If I run conv SE US
>>on such files they get much prettier. So I can't imagine that any host in
>>between has already done it. Or is there no "EUnet backbone" between Sweden
>>and Germany?

>I get a little anxious here, but I may misunderstand some things
>here. I certainly don't want mail I send out to be automatically
>transformed when they get out. Yes, I understand that occurrances
>of ][\}{| are not nice to read, but it seems a risky business to
>translate them straight off. If I use them in an non-Swedish mail,
>I usually explain them. With a non-wanted transformation, that
>would look a little stupid. (And how does the machine know that
>I use an "[" as a dotted capital "A" and not as a left bracket?)
>Wouldn't it be better, if this was done at receiver's end on request?

Yes, I share Erland's concerns. You cannot just translate 7-bit [\]
(these 7-bit values are defined as letters in both Swedish and
Danish 7-bit) to ISO 8859-1 Swedish/Danish letters.
What we do at dkuug.dk (the Danish Internet backbone) is transforming
both 8-bit curly braces and Scandinavian letters to 7-bit [\].
The other way, from 7-bit Danish or Swedish to ASCII or some 8-bit code,
we normally do not touch these codes.

The  conversion we do here are mostly for use on 8-bit machines,
where some run ISO 8859-1 and some runs some IBM Codepage.

Doing it at the receivers end: well the receiver needs to know
what information is in there. This information must be generated
on the senders side, who knows what the message is.

>Another question: Through a mailing-list I have indirectly received
>a list of two-character code stemming from Keld Simonsen. I don't
>know whether it is this one we discuss, but I would assume so. I
>must admit that I laid that one aside with the thought: "My God,
>how unreadable and what an overkill!" I tend to think I missed
>some points with its purpose. Could Keld or anyone else clarify?

Yes, I have made a quite elaborate list of character names, 
which is being used for mail. It is designed for worldwide use,
and the world is big. There is about 940 characters in there
covering all 7 and 8-bit character sets I know of. It does not 
yet contain any Japanese nor Chinese character.

The character names are primarily used to identify a character
and to be able to registrate properties of these, such as membership
of a character set or that it is a lower case character, and then
the upper case character can be specified alongside.
It does have some mnemonic value, eg a with dieresis (a-umlaut)
is called "a:". How readable and beautiful this is can always
be discussed, but there are some rules to it which are consistently
applied. It is also been designed with short names of the characters
to improve compactness and translation costs, and also to improve
readability and writability.

>And a final question: we are moving into an eight-bit world. Instead
>of relying on old standards, why not aim to have Eunet work with ISO
>8859/1 instead? (8859 is apparently already obsolete with the recent
>changes in Eastern Europe, but that is another matter.)

I am collaborating with a fellow countryman of yours, Dan Oscarsson
from LTH, on using the new ISO 10646 character set for email.
This character set has almost all characters in the world in a 32 bit
compactable code set.

No ISO 8859 is not outdated. ISO 8859-2 covers Eastern Europe,
and ISO 859-5 covers Russia (Cyrillic). 8859 does not cover 
Japanese and other Eastern character sets, though. This was the reason
we decided on ISO 10646.

Keld Simonsen

sommar@enea.se (Erland Sommarskog) (08/12/90)

Keld J|rn Simonsen (keld@login.dkuug.dk) writes:
>No ISO 8859 is not outdated. ISO 8859-2 covers Eastern Europe,
>and ISO 859-5 covers Russia (Cyrillic). 8859 does not cover
>Japanese and other Eastern character sets, though. This was the reason
>we decided on ISO 10646.

What I meant when I said that 8859 was obsolete is that one year
ago it seemed like you could live with having to change to
another character set to read and write Polish, Hungarian etc,
since the political and econimical situation would make such
cases would be rare. Now when they suddenly are joining the
free world this cases could be expected to be more freequent.
And I am not only talking articles and mail in these languages,
but also multi-language texts. For instance if Lech Walesa ever
appears on Usenet, it would be nice if his name could come up
right with a slashed "l" and cedilla on the "e".

(But of course, with all ancient mailers, archaic Unix varieties,
GNU-Emacs, etc, I'm in doubts that anything than plain seven-bit 
ASCII will ever be regarded comme-il-faut on Usenet. :-)
-- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se

src@scuzzy.mbx.sub.org (Heiko Blume) (08/14/90)

sommar@enea.se (Erland Sommarskog) writes:
>(But of course, with all ancient mailers, archaic Unix varieties,
>GNU-Emacs, etc, I'm in doubts that anything than plain seven-bit 
>ASCII will ever be regarded comme-il-faut on Usenet. :-)

as far as i know GNU emacs 19 will have 16bit characters.
-- 
Heiko Blume c/o Diakite   blume@scuzzy.mbx.sub.org    FAX   (+49 30) 882 50 65
Kottbusser Damm 28        blume@netmbx.UUCP           VOICE (+49 30) 691 88 93
D-1000 Berlin 61          blume@netmbx.de             TELEX 184174 intro d
scuzzy Any ACU,e 19200 6919520 ogin:--ogin: nuucp ssword: nuucp

yfcw14@castle.ed.ac.uk (K P Donnelly) (08/14/90)

I find that I can send 8-bit mail messages over the UK JANET network
to and from VAX/VMS machines without trouble.  However, it seems that if
mail goes anywhere near a Unix machine it gets the eighth bit stripped.
The trouble seems to be the file transfer utility hhcp.

I find that I can work happily in 8-bits on Edinburgh University's
central Unix machine, provided I set  stty -odd  or  stty -even.  The
new versions of microEMACS, MS-KERMIT and TeX all support 8-bit text.
However, if I try to transfer an 8-bit text file into the Unix machine
using hhcp it gets the eighth bit stripped; and if I try to transfer an
8-bit text file out from the Unix machine, hhcp treats it as binary file
and the file gets horribly garbled.

An Irish Gaelic conferencing system which I have accessed is hosted on
a VAX/VMS machine and uses ISO 8859-1 as standard.  However, when I
access it, via IPSS and the Irish commercial packet switch network,
EIRPAC, the eighth bit gets stripped somewhere along the way.

Would anyone like to summarize experiences elsewhere with 8-bit mail.
Are any networks already happily using ISO 8859-1 for mail?  What are the
main bottlenecks at present to 8-bit work?

   Kevin Donnelly

prc@erbe.se (Robert Claeson) (08/15/90)

In article <5681@castle.ed.ac.uk>, yfcw14@castle.ed.ac.uk (K P Donnelly) writes:

> I find that I can send 8-bit mail messages over the UK JANET network
> to and from VAX/VMS machines without trouble.  However, it seems that if
> mail goes anywhere near a Unix machine it gets the eighth bit stripped.
> The trouble seems to be the file transfer utility hhcp.

I assume that with "hhcp" you mean "uucp". Uucp doesn't strip anything.
One can send binary files using uucp (this is how we get our 'news').
I believe that the problem lies in "sendmail", especially as implemented
on many UNIX systems running BSD UNIX. Many System V UNIXes that comes
with sendmail does allow 8 bit characters.

> Would anyone like to summarize experiences elsewhere with 8-bit mail.
> Are any networks already happily using ISO 8859-1 for mail?

Yes, our internal network is using ISO 8859/1 for e-mail. It consists
of UNIX hosts only. However, as soon as any message goes up to the EUnet
backbone here, the eight bit is chopped off all characters in the message.

> What are the main bottlenecks at present to 8-bit work?

I don't believe that there are any. In fact, chopping off the eight bit,
as done on many hosts, is likely to consume more CPU power. The network
bandwidth used would be the same.

The X.25 protocol as such is also eight-bit clean, but when e-mail is
transferred over X.3/X.29 (ie, the PAD function), it is common to have
to encode 8 bit data using plain ASCII or something similar, as done
by, for example the uucp 'f' protocol, btoa and uuencode. This is because
many PADs don't permit a completely eight-bit transparent path.

-- 
Robert Claeson                  |Reasonable mailers: rclaeson@erbe.se
ERBE DATA AB                    |      Dumb mailers: rclaeson%erbe.se@sunet.se
                                |  Perverse mailers: rclaeson%erbe.se@encore.com
These opinions reflect my personal views and not those of my employer (ask him).

tkld@castle.ed.ac.uk (K Davidson) (08/17/90)

In article <1740@hugo.erbe.se> prc@erbe.se (Robert Claeson) writes:
>In article <5681@castle.ed.ac.uk>, yfcw14@castle.ed.ac.uk (K P Donnelly) writes:
>
>> I find that I can send 8-bit mail messages over the UK JANET network
>> to and from VAX/VMS machines without trouble.  However, it seems that if
>> mail goes anywhere near a Unix machine it gets the eighth bit stripped.
>> The trouble seems to be the file transfer utility hhcp.
>
>I assume that with "hhcp" you mean "uucp". Uucp doesn't strip anything.
>One can send binary files using uucp (this is how we get our 'news').
>I believe that the problem lies in "sendmail", especially as implemented
>on many UNIX systems running BSD UNIX. Many System V UNIXes that comes
>with sendmail does allow 8 bit characters.

  No, he really does mean hhcp. JANET does not use TCP/IP or uucp, it has
its own ``coloured book'' protocols, one of which is NIFTP (I forget the
colour) for transfering files. This is what hhcp uses.
  hhcp -b8 file remote:file might work, but most Unix utilities that deal
with rs232 lines seem to have 7bit braindamage all the way through them.
  Get hold of a uu{en,de}code that compiles on a VMS system and encode your
mail. ( Horrible isn't it :-( )

>-- 
>Robert Claeson                  |Reasonable mailers: rclaeson@erbe.se
>ERBE DATA AB                    |      Dumb mailers: rclaeson%erbe.se@sunet.se
>                                |  Perverse mailers: rclaeson%erbe.se@encore.com
>These opinions reflect my personal views and not those of my employer (ask him).

I
have
to
add
this
because
Pnews
complains.
What
idiot
wrote 
this? :-(
-- 
	.Kevin.
  <tkld@castle.ed.ac.uk> || <tkld@lfcs.ed.ac.uk> || <tkld@tardis.cs.ed.ac.uk>
                                         ...and he did not think it too many.

richard@aiai.ed.ac.uk (Richard Tobin) (08/21/90)

In article <1740@hugo.erbe.se> prc@erbe.se (Robert Claeson) writes:
>> I find that I can send 8-bit mail messages over the UK JANET network
>> to and from VAX/VMS machines without trouble.  However, it seems that if
>> mail goes anywhere near a Unix machine it gets the eighth bit stripped.
>> The trouble seems to be the file transfer utility hhcp.
>
>I assume that with "hhcp" you mean "uucp". 

No, he means hhcp.  Hhcp is a program implementing NIFTP
("network-independent file transfer protocol").  Unfortunately, it
doesn't implement it well enough for 8-bit transfers to work.

The problem does not exist with the (free) unix-niftp software, and
possibly not with more recent versions of hhcp.

-- Richard
-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin