[comp.std.internat] universality of Latin-1

erik@srava.sra.co.jp (Erik M. van der Poel) (04/10/91)

I'm directing followups to comp.std.internat.

John Gilmore writes:
> And my windows all use ISO Latin 1.  If Torbj|rn would send the
> umlauted letter in that standardized character set, it would look right
> in both the States and in Sweden.

Have you ever tried to send yourself a message in Latin-1? Did it
work? And even if *you* have a reasonable version of sendmail (one
that doesn't strip the 8th bit), what makes you so certain that
Torbj|rn's message and anyone else's won't pass through a site that
*does* strip the 8th bit?

Also, what's so "standardized" about ISO Latin-1? What makes it more
standard than, say, Latin-2?
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692

randall@Virginia.EDU (Randall Atkinson) (04/11/91)

John Gilmore originally wrote:
% And my windows all use ISO Latin 1.  If Torbj|rn would send the
% umlauted letter in that standardized character set, it would look right
% in both the States and in Sweden.

In article <1110@sranha.sra.co.jp>, 
	Erik M. van der Poel <erik@srava.sra.co.jp> responded:

>Have you ever tried to send yourself a message in Latin-1? Did it
>work? And even if *you* have a reasonable version of sendmail (one
>that doesn't strip the 8th bit), what makes you so certain that
>Torbj|rn's message and anyone else's won't pass through a site that
>*does* strip the 8th bit?

It does work for a fair and ever increasing subset of the Internet.
BITNET doesn't do very well with it.  Clearly we need to move towards
8-bit and 16-bit and 32-bit transparent mail transport mechanisms.
Fortunately there are a number of possible transport mechanisms out
there to choose from, some of which are already 8-bit transparent.

>Also, what's so "standardized" about ISO Latin-1? What makes it more
>standard than, say, Latin-2?

ISO 8859/1 is NOT any "more standard" than ISO 8859/2, however sites
in the US are in fact migrating towards ISO 8859/1 from US ASCII and
most sites in the US are NOT migrating towards ISO 8859/2 (though they
might support it on the side as vendors begin to).  The languages that
are most commonly used in the US are in ISO 8859/1 and the languages
supported by ISO 8859/2 are less commonly used (again in the US as a
whole).  

Note that ISO Latin-1 is ISO 8859/1 which is the 8-bit character set
used for Western European languages.  ISO Latin-2 is ISO 8859/2 which
is the 8-bit character set for Eastern European languages.

Clearly we need to add additional information to the header of mail 
messages to indicate which character set to use.  I'm not sure of
the current state of the Internet protocols (RFC 822 et. al.) with
respect to this.  If there isn't the equivalent of a "Character-set:"
header yet, serious consideration should be given to adding one with
clearly defined values for at least existing ANSI and ISO character
sets.

Character sets that should have a defined string to use with such a
header field include at least:
	ASCII
	ISO 8859/1 
          ...
        ISO 8859/N  (where N is the last defined set)
        ISO 10646   (once it gets completed)

The Internet is the dominant mail transport network at present, partly
because so many other networks gateway with it.  Getting the Internet
to convert to supporting such needs would be a big step in the right
direction.  Perhaps someone on the IETF can comment on their current
activities in this area ??

Ran Atkinson
randall@Virginia.EDU

dlv@cunyvms1.gc.cuny.edu (Dimitri Vulis, CUNY GC Math) (04/12/91)

In article <1991Apr10.172756.4991@murdoch.acc.Virginia.EDU>, randall@Virginia.EDU (Randall Atkinson) writes:
>        ISO 10646   (once it gets completed)
"Unicode" seems both more practical and more realistic.
>Ran Atkinson
>randall@Virginia.EDU
Dimitri Vulis, D&M
BITNET:            DLV@CUNYVMS1
Internet:          DLV@CUNYVMS1.GC.CUNY.EDU
Snail:             Department of Mathematics/Box 330
                   City University of New York Graduate Center
                   33 West 42 Street
                   New York, NY 10036-8099
                   USA

enag@ifi.uio.no (Erik Naggum) (04/12/91)

In article <1110@sranha.sra.co.jp> erik@srava.sra.co.jp (Erik M. van der Poel) writes:

   John Gilmore writes:
   > And my windows all use ISO Latin 1.  If Torbj|rn would send the
   > umlauted letter in that standardized character set, it would look right
   > in both the States and in Sweden.

   Have you ever tried to send yourself a message in Latin-1? Did it
   work? And even if *you* have a reasonable version of sendmail (one
   that doesn't strip the 8th bit), what makes you so certain that
   Torbj|rn's message and anyone else's won't pass through a site that
   *does* strip the 8th bit?

Relax, we're working on that.  It doesn't really take an 8-bit SMTP
data path to get this done, although many think it would be kind of
useful.  Please don't confuse the transport layer word width (7 bits)
with the transported data's word width (e.g. 8 bits).

   Also, what's so "standardized" about ISO Latin-1? What makes it more
   standard than, say, Latin-2?

I don't think anyone is discussing which is the "more" standardized
part of the ISO 8859 family, it's just that ISO 8859-1 has been
adopted by more people more places than any other part has, partly
because it's better organized (IMO).

As an example, using guillemot quotes +like this;, if you get + and ;,
you didn't benefit from ISO 8859-1 right now.  Maybe in the future.

--
I don't need ISO 8859-1 to spell my name.  Thanks, mom & dad.
--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

rja7m@calico.cs.Virginia.EDU (Ran Atkinson) (04/12/91)

UNICODE isn't a sufficient solution as it doesn't fully support (for
example) Vietnamese.  DIS 10646 is a sufficient solution.

I wish it were otherwise, but I have to live in the real world...

eliot@chutney.rtp.dg.com (Topher Eliot) (04/12/91)

In article <1991Apr10.172756.4991@murdoch.acc.Virginia.EDU>, randall@Virginia.EDU (Randall Atkinson) writes:
|> In article <1110@sranha.sra.co.jp>, 
|> 	Erik M. van der Poel <erik@srava.sra.co.jp> responded:
|> >Have you ever tried to send yourself a message in Latin-1? Did it
|> >work? And even if *you* have a reasonable version of sendmail (one
|> >that doesn't strip the 8th bit), what makes you so certain that
|> >Torbj|rn's message and anyone else's won't pass through a site that
|> >*does* strip the 8th bit?
|> It does work for a fair and ever increasing subset of the Internet.
|> BITNET doesn't do very well with it.  Clearly we need to move towards
|> 8-bit and 16-bit and 32-bit transparent mail transport mechanisms.

I expected to see someone else post a more authoritative answer, but since
none has been forthcoming, I will venture.  The folks who work on such things
have been considering the 8-bit, different-codeset issues, as part of a much
larger picture of including such things as graphics and other binary
information in mail.  Since those are harder problems, it means that they
won't have solutions all that quickly.  There is a mailing list on this
subject; if you really need it I can probaly dig out a lead on how to get
onto that mailing list.

|> Fortunately there are a number of possible transport mechanisms out
|> there to choose from, some of which are already 8-bit transparent.
Ack!  "Fortunately"?  There is an ancient curse:  "may you live in interesting
times".  I think it's modern equivalent is "may you have many standards to
choose from".  

-- 
Topher Eliot                           Data General DG/UX Internationalization
(919) 248-6371        62 T. W. Alexander Dr., Research Triangle Park, NC 27709
eliot@dg-rtp.dg.com                           {backbone}!mcnc!rti!dg-rtp!eliot
Obviously, I speak for myself, not for DG.

dlv@cunyvms1.gc.cuny.edu (Dimitri Vulis, CUNY GC Math) (04/14/91)

In article <ENAG.91Apr12040930@maud.ifi.uio.no>, enag@ifi.uio.no (Erik Naggum) writes:
>As an example, using guillemot quotes +like this;, if you get + and ;,
'Guillemet'.  This word was misspelled by some jerk from Adobe, and now
no one knows how to spell is right. :)

Dimitri Vulis, D&M
BITNET:            DLV@CUNYVMS1
Internet:          DLV@CUNYVMS1.GC.CUNY.EDU
Snail:             Department of Mathematics/Box 330
                   City University of New York Graduate Center
                   33 West 42 Street
                   New York, NY 10036-8099
                   USA

dlv@cunyvms1.gc.cuny.edu (Dimitri Vulis, CUNY GC Math) (04/14/91)

In article <1991Apr12.123302.17817@murdoch.acc.Virginia.EDU>, rja7m@calico.cs.Virginia.EDU (Ran Atkinson) writes:
>UNICODE isn't a sufficient solution as it doesn't fully support (for
>example) Vietnamese.  DIS 10646 is a sufficient solution.
*NOT TRUE*

Unicode supports Vietnamese.  You either don't know what you're taling about
or you're lying.

On the other hand, Cyrillic support in 10646 totally sucks, and Unicode
got it right.  Our proposed comments on what's wrong with the Cyrillic in
10646 are 12 pages long. :)

>
>I wish it were otherwise, but I have to live in the real world...

Well, it's certainly easier to get a copy of Unicode than of 10646 to see
for oneself what's in it and what's not...

Dimitri Vulis, D&M
BITNET:            DLV@CUNYVMS1
Internet:          DLV@CUNYVMS1.GC.CUNY.EDU
Snail:             Department of Mathematics/Box 330
                   City University of New York Graduate Center
                   33 West 42 Street
                   New York, NY 10036-8099
                   USA

amanda@visix.com (Amanda Walker) (04/16/91)

rja7m@calico.cs.Virginia.EDU (Ran Atkinson) writes:

   UNICODE isn't a sufficient solution as it doesn't fully support (for
   example) Vietnamese.  DIS 10646 is a sufficient solution.

Yup.  Unfortunately, I suspect that Unicode is doomed to popularity,
thanks to support from major U.S. manufacturers (IBM, Apple, etc.).
--
Amanda Walker						      amanda@visix.com
Visix Software Inc.					...!uunet!visix!amanda
-- 
I am the Imp of the Perverse (knowing this won't help you, either).

enag@ifi.uio.no (Erik Naggum) (04/18/91)

Dimitri,

I'm sure we all need your angry comments to help us reach a consensus
in this admittedly delicate matter.

Unicode is not the answer.  ISO DIS 10646 is not the answer.  As I see
it, it's too early for an answer.  However, we need a reference
standard, just like we needed a reference standard for the Latin
script, and that's ISO 6937-2.  I think ISO DIS 10646 will make a
great reference standard due to its clean design, but will not make it
into production use due to its currently lacking implementation of
that design, and it may never see widespread use except as an
interchange standard.  I don't think that's bad at all.

Let's try to fix the problem, not try to hide behind the pretention
that the _other_ party's problems are sufficiently bigger than ours.

And please remember that we're dealing with international politics and
diplomacy in ISO DIS 10646, while Unicode is pretty much free of that.
This is a problem we can't fix.

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

gwb@crosfield.co.uk (George Battrick) (04/19/91)

In article <1991Apr14.024739.3042@timessqr.gc.cuny.edu> dlv@cunyvms1.gc.cuny.edu writes:
"In article <ENAG.91Apr12040930@maud.ifi.uio.no>, enag@ifi.uio.no (Erik Naggum) writes:
">As an example, using guillemot quotes +like this;, if you get + and ;,
"'Guillemet'.  This word was misspelled by some jerk from Adobe, and now
"no one knows how to spell is right. :)
"
This is more amusing than I had realised.  The <<French>> quotation
marks, as approximated on the line above, are indeed called
"guillemets": the pronunciation is approximately "gee-uh-may" (hard
"g" as in "get").  But there *is* a word "guillemot". It's pronounced
"gilly-mott" (again a hard "g"), and it's a sea-bird of the awk
family.

[No: nothing to do with the Unix "awk"  :-) ]


-- 
George Battrick    Crosfield Electronics Ltd   Hemel Hempstead  HP2 7RH   U.K.
gwb@cel.uucp   -or-  gwb@crosfield.co.uk   -or-  ...!{mcsun,ukc,uunet}!cel!gwb
phone: +44 442 230000 ext 3638    fax: +44 442 232301   telex: 827530 CROSEL G
#include <disclaimer.std>    "Remember, George: this is no time to go wobbly!"

enag@ifi.uio.no (Erik Naggum) (04/22/91)

In article <9504@sun101.crosfield.co.uk> gwb@crosfield.co.uk (George Battrick) writes:

   In article <1991Apr14.024739.3042@timessqr.gc.cuny.edu> dlv@cunyvms1.gc.cuny.edu writes:
   >In article <ENAG.91Apr12040930@maud.ifi.uio.no>, enag@ifi.uio.no (Erik Naggum) writes:
   >>As an example, using guillemot quotes +like this;, if you get + and ;,
   >'Guillemet'.  This word was misspelled by some jerk from Adobe, and now
   >no one knows how to spell is right. :)

   This is more amusing than I had realised.  The <<French>> quotation
   marks, as approximated on the line above, are indeed called
   "guillemets": the pronunciation is approximately "gee-uh-may" (hard
   "g" as in "get").  But there *is* a word "guillemot". It's
   pronounced "gilly-mott" (again a hard "g"), and it's a sea-bird of
   the awk family.

I looked this up in my French-English dictionary, and the stupid Brit
who "translated" it managed to make it into "inverted commas".  Geez.

Guillemot is indeed a bird, also in English, according to the same
dictionary.  Then again, Norwegians call `"' "goose-eyes", so one more
bird didn't look overly strange to me.  :-)

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

daniels@parc.xerox.com (Andy Daniels) (04/25/91)

In article <1991Apr12.123302.17817@murdoch.acc.Virginia.EDU> rja7m@calico.cs.Virginia.EDU (Ran Atkinson) writes:
>UNICODE isn't a sufficient solution as it doesn't fully support (for
>example) Vietnamese.  DIS 10646 is a sufficient solution.
>

Sufficient for you, perhaps, but not for me. By your criteria, DIS
10646 doesn't support Rhade, a close neighbor of Vietnamese, nor does
it support Navajo. Moving away from Latin, where's Tamil? where's
Tibetan?

If you're looking for your favorite combination of Latin character +
applied accents as a single character in Unicode, you've missed the
point. Just about all such combination that you find in there are
included as a result of political compromise. The "pure Unicode"
approach is that if you want, for instance, 'A' with a circumflex and
underdot, you emit exactly those three characters - it's up to your
rendering software to display the composite glyph correctly. (That's
funny, I seem to have described a Vietnamese character that Unicode
"doesn't support.")

One can argue endlessly (in fact, people do) about just which set of
characters are "base" letters and which are applied marks, but the
situation in the real world is that new characters are formed in the
Latin (and to some extent) Cyrillic scripts by putting random marks on
other characters that already exist. If you try to enumerate all of
the legal possibilities, you're bound to have somebody come up to you
the day after you've sent your standard to the publisher and tell you,
"but you don't have e-diaresis-rude." You can try to include
optimizations for your favorite set, but you will then invariably
offend the people who use the first ones you've left out.

		-- Andy. --