[comp.protocols.iso] Character sets: ISO 6937 vs ISO 8859

S.Kille@CS.UCL.AC.UK (Steve Kille) (11/05/90)

The new X.400(1988) body parts refer to characters according to ISO 6937
(General Text).

The X-Windows system makes extensive use of the ISO 8859 character set.

Can anyone explain why different choices have been made by these groups,
and of the resepctive merits of the two standards (I have seen 6937, but not
8859).


Steve

eskovgaa@UVCW.UVIC.CA (Erik Skovgaard) (11/06/90)

ISO 6937 was originally in the first draft MOTIS document.  CEN/CENELEC
and NIST chose to include this character set together with a "default
rendition method".  The latter maps non-ASCII characters to ASCII codes
so they can be displayed on an ASCII terminal, arbeit with loss of
information.

ISO 6937 has several options, but basically extends the IA5 by prefixing
some characters with an escape code.  ISO 8859 extends IA5 by using the
eighth bit and thus, less characters are possible than ISO 6937.

In fact, we had a lot of arguments in NIST over which codes we should
include as body parts.  One of the arguments used was that FTAM
supports ISO 8859, but as I recall, we settled on ISO 6937 since this
was decided by CEN/CENELEC in their X.400 profile.

                                            ....Erik.

K.P.Donnelly@edinburgh.ac.uk (11/06/90)

ISO 6937 and ISO 8859 are both extensions of ASCII (or ISO 646) to 8 bits.
Both of them avoid not only the 32 control characters of ISO 646 (columns
0 and 1) but also their 8-bit equivalents (columns 8 and 9), so as to avoid
possible transmission or other difficulties.

The most important difference is that ISO 6937 has "floating diacritics"
- characters of "zero width" representing accents, so that accented characters
are represented by two bytes, one for the unaccented character and one for
the accent.  This means that it can accommodate many more accented characters
within 8 bits than can ISO 8859.  In fact it copes with almost all languages
using a Latin based alphabet.  

However, it also means that most existing software, such as text editors,
will not cope with ISO 6937, whereas most software needs little or no
modification to work with ISO 8859.  Probably this is why ISO 6937,
although it came earlier than ISO 8859, has never really been adopted,
whereas ISO 8859 is becoming very widely used.

ISO 8859, because of the limited number of characters which it gets into
8 bits, has to be split into several parts.  ISO 8859-1 covers nearly all
Western European languages, which includes a lot of languages with economic
clout.  ISO 8859-2 covers Eastern European languages with a Latin based
alphabet such as Czech and Polish.  ISO 8859-2 and 8859-3 mop up some of
the gaps.  ISO 8859-5 is for languages like Russian with a Cyrillic alphabet,
8859-6 is for Arabic, 8859-7 is for Greek and 8859-8 is for Hebrew.
ISO 8859-9 is a late addition; it adds to ISO 8859-1 the characters needed for
Turkish, at the expense of Icelandic, which has far fewer speakers than Turkish
but which got included ISO 8859-1 because the Icelanders got into 8-bit
computing at an early stage and also because some of the characters are used
in Old English.  I don't know whether ISO 6937 has any additional parts for
languages such as Russian or Arabic with a non Latin alphabet.

ISO 6937 is a development from Teletex.  ISO 8859-1 is a development of the
DEC multinational character set.  Various manufacturors extended ASCII to
8 bits in various ways (e.g. IBM-PC character set; HP Roman 8 character set
used on Laserjet II laser printers), but the DEC multinational character set
has a far more logical layout of characters than the others.  ISO 8859-1 is
used on DEC VT320 terminals, and terminal emulations such as MS-Kermit 3.0.
The reason that X.400(1988) refers to ISO 6937 whereas X-Windows makes use of
ISO 8859 may be the association between CCITT and Teletex and the association
between DEC and the development of X-Windows, or it may just be that 
X.400(1988) was developed earlier on.

It is now regarded as wasteful having anything like as many as 64 character
positions reserved for control characters, and proposals have been made to 
extend ISO 8859-1 to cover more languages.  Alternatively, it is possible
that ISO 6937 might make something of a comeback within the context of
structured documents.  Or both ideas might be leapfrogged by two-byte or
multi-byte character sets, with file compression for storage.

I am no expert and some of the above information may be wrong.  If so, I would
be glad of corrections.

   Kevin Donnelly

Stef@ICS.UCI.EDU (Einar Stefferud) (11/06/90)

Since no one has hit Steve's question on the head yet, I will take a
shot at it.  

8859 was designed to facilitate Data Processing, and thus it is limited
to only 8 bit codes so as to avoid the pain of data processing on mixed
length character codes.  

6937 is more "transmission" oriented, with escape codes to signal
semantic shifts for subsequent characters.  6937 is favored by various
communication oriented processor vendors.  

I believe that XEROX uses 6937 quite effectively to support many
languages and character sets for their very much internationally
oriented document publishing systems.  

FTAM supported 8859 because of the data processing orientation of the
people involved with making the implementors agreement profiles.  X.400
has an obvious tilt toward documents rather than business records.  

Hope this helps at the meta understanding level.  The conflict between
8859 and 6937 is thus deep and unresolvable, though there are some
incomplete ways to map some very useful parts of 6937 onto 8859.  I
expect that all 8859 characters have 6937 equivalents, but this is only
a guess on my part.  

I have no knowledge of 10464, though I expect that it is intended to
somehow resolve the problems between 8859 and 6937.  

The character set question is a very big mess, and getting worse as the
effort to close on something common for the world takes root.  There are
3 main camps.  

North America, where we have little problem with just using ASCII, and
we wish the rest of the world would settle the question without making
life too complicated for our users who only have ASCII keyboards.  

Europe, where there are many alphabets and lots of accents and umlauts.
EWOS and others in EU are becoming deeply involved in this mess.  

Asia, where there are 3 main KANJI alphabets which are very difficult to
meld into some kind of single "alphabet".  Japanese KANJI characters are
strictly limited, and Katagana characters are used as modifiers to
extend the limited set.  Chinese KANJI is not so limited, with new
characters being invented over time, and with no "Katagana analog" to
use for extension.  I expect that Korean is more like Chinese, but I am
not even slightly expert in this.  Are there any other idiogram
alphabets?  

Anyway, the overall problems will have to be resolved among those
countries that have real problems with any loss of the right to use any
of their normally used characters as we move to electronic media.
Although us ASCII folk may think to look askance at all this character
set confusion, I think that we should at least offer our sympathies to
those with the real problems, while we try to keep things from getting
too complicated.  

I sort of shudder at being required to enter Katagana or KANJI or
umlauts and accents into ORAddresses, now that X.400 allows T.61
characters on ORAddresses.  I wonder how it can be done with my present
systems and keyboards?  

I have seen how the Japanese have modified EMACS to input and display
Katagana and KANJI.  Rather ingenious it is, and a real testimonial to
the power to EMACS.  

Best...\Stef

Harald.alvestrand@elab-runit.sintef.no (11/06/90)

ISO 6937 defines floating accents, that is, an A with an accent is represented
as "accent-sign A", 2 octets.
ISO 8859 defines a single sign "accented A".
ISO 6937 also lists the "supported combinations" of accents and characters, and
has a non-spacing underline, which means that you can underline anything.
That in turn means that an eight-character name can take 24 bytes of storage
if it is all underlined, accented characters. Makes things a bit problematic
for programmers of FORTRAN.
In total, ISO 6937 requires about 316 characters or character-accent
combinations
to be supported. That covers the needs of the Europeans that use Latin
alphabets.

The question of switching character sets belongs to ISO 2022, which defines
escape sequences for the purpose. That in turn refers the "international
registry
of character sets", which is maintained by somebody, I THINK it is ECMA, but
I do not remember this clearly.

BTW, ISO 8859 is really a collection of character sets, numbered from
ISO 8859-1
(the one the US people are pushing) to ISO 8859-9 (as of now). In all the sets,
the lower 128 positions are defined in the same way, but the higher positions
may have changes. I believe 8859-4 is suitable for the East European languages
(characters like C with an inverted circumflex accent are very important in
writing the languages of Chekoslovakia, for instance).
So, switching BETWEEN character sets is a requirement, at least until ISO 10646
is finalized (if ever). That one attempts to land every character in the world
inside one big 32-bit character set, with ISO 8859-1 as the first 256 bit
positions, leading to easy compression of 8859-1 text :-)
Any clarity added?

                 Harald Tveit Alvestrand
                 harald.alvestrand@elab-runit.sintef.no

philip@beeblebrox.dle.dg.com (Philip Gladstone) (11/06/90)

Standard character sets are a true minefield. The scoop is (I think as
follows):

8859.1  8-bit   X-Windows character set
8859.n  8-bit   A family of character sets, including greek, cyrillic
                etc.
6937.n  8-bit (with *some* two byte characters)
                Overall includes the same set of characters as the
                entire 8859.n family. 6937.1 & 6937.2 are (roughly)
                equivalent to T.61 (1984) excluding Kanji.
T.61 (84) 8-bit (some two byte character), 16-bit Kanji.
                This contains all western european characters and the
                Japanese Kanji set. Note that the Kanji set contains
                Cyrillic and *some* greek characters (but no terminal
                sigma for instance).
T.61 (88) 8-bit (some two byte characters), 16-bit Kanji, 8-bit Greek
                This is an enhancement to the 84 version by the
                addition of an 8-bit greek set. I think that Chinese 
                also got added.
JIS C 6226  16-bit
                This is known as Kanji.
JIS X 0208      This is the current name for JIS C 6226.

T61String (TeletexString) is subtly incompatible between 1984 and 1988
X.400 (but I have a defect report in about that). Note that in any
event, T61String (88) is a SUPERSET of 84 in that greek characters are
allowed. 

Also please note that the curly bracket characters {} *are* in
T61String, its is just that they are rather difficult to find being
located in the kanji portion. 

In answer to your question 'What are the respective merits?':

    As far as I am concerned, each character set for conveying data is
    sufficiently different, with different escape sequences etc, that
    you need to have a comprehensive solution to the character set 
    problem. Once you have this, it doesn't matter much which set you
    use provided you don't start losing characters.


     
--
Philip Gladstone      
Development Lab Europe 
Data General, Cambridge
England.  +44 223-67600

hitoaki@kyo-sr.ntt.junet (SAKAMOTO Hitoaki) (11/07/90)

In article <PHILIP.90Nov6103250@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:
>   JIS C 6226  16-bit
>		   This is known as Kanji.
>   JIS X 0208      This is the current name for JIS C 6226.

JIS X0201-1976	Code for Information Interchange.
JIS X0208-1990  Code of the Japanese Graphic Character Set 
						for Information Interchange.
JIS X0212-1990  Code of the Supplementary Japanese Graphic Character Set
						for Information Interchange.

 "JIS X0201" is 7bit or 8bit encoding roman and "Kana" characters.
 "JIS X0208" and "JIS X0212" is 16bit encoding "Kanji" characters.

------
Hitoaki Sakamoto ( hitoaki@nttlab.ntt.jp )
Nippon Telephone and Telegraph corporation. 
Tokyo Technical and Development Center.

hitoaki@kyo-sr.ntt.junet (SAKAMOTO Hitoaki) (11/07/90)

I am interest in X.400 with Japanese.  
My English is very poor, sorry. ( I want to write Japanese . )

>   Asia, where there are 3 main KANJI alphabets which are very difficult to
>   meld into some kind of single "alphabet". Japanese KANJI characters are
>   strictly limited, and Katagana characters are used as modifiers to
>   extend the limited set.  

Why do you think about Kanji characters? 

>   I sort of shudder at being required to enter Katagana or KANJI or
>   umlauts and accents into ORAddresses, now that X.400 allows T.61
>   characters on ORAddresses.  

I Think so..... 

>   I wonder how it can be done with my present
>   systems and keyboards?  

For example, Sun has JLE ( Japasnese Language Enviroment?). But We use
Normal Export version of Sun-OS and Keyboard. I do not sort of shudder. (^_^)
Because We are using X-Window System.

>   I have seen how the Japanese have modified EMACS to input and display
>   Katagana and KANJI.  Rather ingenious it is, and a real testimonial to
>   the power to EMACS.  

I use Japanese Emacs (Nemacs) with EGG now. It  is input and display
kanakana and Kanji characters. 

--------------------
Hitoaki Sakamoto  ( hitoaki@nttlab.ntt.jp )
Nippon Telephone and Telegraph Corporation.
Tokyo Technical and Development Center.

ronald@robobar.co.uk (Ronald S H Khoo) (11/08/90)

[ n.b. followups redirected for this question ]
philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:

> Standard character sets are a true minefield. The scoop is (I think as
> follows):

[ useful informative table deleted ]

Can anyone tell me if the line drawing characters are standardised in any
ISO standard?  I mean the kinds of characters you use for drawing boxes on
terminals, e.g. like the ones that really annoy you when your VT100 has
received some line noise :-)

Email would be appreciated.  If the info is forthcoming, I will summarise
(to std.internat).

Thank you.
-- 
ronald@robobar.co.uk +44 81 991 1142 (O) +44 71 229 7741 (H)

prc@erbe.se (Robert Claeson) (11/08/90)

In a recent article K.P.Donnelly@edinburgh.ac.uk writes:

>ISO 8859-1 is a development of the DEC multinational character set.

Actually, the way I got this explained for me was that when DEC needed
an 8 bit character set, they took an early draft of ISO 8859/1. Later
drafts of ISO 8859/1 changed, and thus there are now about 10 characters
in the right half that's different between DEC Multinational and ISO
8859/1.

The DEC VT200 series has DEC Multinational as the only 8 bit character
set. The VT300 and VT400 series adds the true ISO 8859/1 character
set.

Anyone who knows the *true* story behind this?

-- 
Robert Claeson                  |Reasonable mailers: rclaeson@erbe.se
ERBE DATA AB                    |      Dumb mailers: rclaeson%erbe.se@sunet.se
                                |  Perverse mailers: rclaeson%erbe.se@encore.com
These opinions reflect my personal views and not those of my employer.

lance@motcsd.csd.mot.com (lance.norskog) (11/08/90)

Stef@ICS.UCI.EDU (Einar Stefferud) writes:

>Are there any other idiogram alphabets?  

It gets worse.  One of the Indian subcontinent language families, 
I believe Hindi and its derivatives, uses modified characters.
Under this system, the word "snake" becomes "snakes" 
by adding a squiggle to the bottom leg of the "s".
Different squiggles means "red snake" or "angry snake".

So, to render a word, you have to treat it as a grammatical parse tree,
with a word and possible modifiers, render the first letter of the 
base word, the rest of the word, and then apply the modifiers to the 
first letter.  

This was explained to me a long time ago, and I'm sure it's bollixed,
but you get the drift.