[comp.std.internat] ISO standards for non-Latin alphabets

djb@wjh12.harvard.edu (David J. Birnbaum) (11/21/89)

Note: The following is a slightly revised version of a paper pre-
sented earlier this year at the Fourth International Conference
on Symbolic and Logical Computing, held at Dakota State Univer-
sity, Madison, South Dakota. Endnote numbers within the text are
enclosed in parentheses. Readers may wish to consult a character
map for ISO 8859/5 (= ECMA 113). While some computational issues
mentioned here will be obvious to USENET readers, I hope the
philological and linguistic perspective will prove interesting.
=================================================================

Issues in Developing International Standards
for Encoding non-Latin Alphabets(1)

David J. Birnbaum
Department of Slavic Languages, University of Pittsburgh
Russian Research Center, Harvard University

djb@wjh12.harvard.edu [Internet]
djb@harvunxw.bitnet [Bitnet]

Introduction

Defining an appropriate character set is the most impor-
tant preliminary to any text processing. The generally accepted
system for encoding English language texts is the American Stan-
dard Code for Information Interchange (ASCII),(2) but the devel-
opment of appropriate standards for other languages and alphabets
has been less successful. As a result of this lack of agreement,
idiosyncratic systems have proliferated, producing predictable
obstacles to the efficient exchange of data.
Recently the International Standards Organization (ISO)
promulgated the 8859 series of standards for a variety of writing
systems. One of these standards, 8859/5,(3) is designed to serve
all six modern Slavic languages that use the Cyrillic alphabet
(Russian, Ukrainian, Belorussian, Bulgarian, Macedonian, and Ser-
bocroatian). My discussion today focuses on general methodologi-
cal issues involved in determining appropriate international
standards, which I illustrate through a specific critique of
8859/5.
The theoretical issues involved include

1) the languages to be covered;
2) the alphabets for these languages;
3) the uses to be served by the standard;
4) the coding for the individual characters.

Additional problems that influence issue 4 include

1) 7-bit or 8-bit sets or other representations;
2) compatibility with other existing standards for the same
languages;
3) compatibility with character sets for other languages;
4) balancing priorities of different languages combined in a
single set;
5) upper and lower case relationships;
6) sorting and string comparison order;
7) differences in information processing and information in-
terchange requirements;
8) hardware and software limitations.

Information Processing and Information Interchange:
the Difference between Local and International Standards

One important preliminary consideration is that informa-
tion processing and information interchange may require different
standards. Information processing is a local concern, where com-
patibility with existing local standards may be important. Fur-
thermore, users may be constrained not only by local conventions,
but also by specific hardware or software configurations and any
peculiarities of their texts. For example, different operating
systems, printers, and application software may reserve particu-
lar control characters.(4) For my own research, which requires
fairly exact transcriptions of orthographically complex medieval
Cyrillic manuscripts, I have to provide for a variety of non-
standard characters, ligatures, and diacritics; other users will
customize their systems in different ways. It is difficult for a
single international Slavic Cyrillic standard to anticipate all
the needs of all users, but there is no real impediment to devel-
oping a sensible standard for modern languages.
Since the optimal solution to a specific limited problem
may be incompatible with a more general standard that must serve
a wider range of users, the most reasonable compromise is that
standards for information interchange should be designed to serve
as many uses as possible as efficiently as possible, with
whatever compromises that entails. Local standards for informa-
tion processing, on the other hand, should be designed separately
to deal effectively with specific local tasks. Filtering text
files is a trivial matter and users can easily convert local
formats to an accepted interchange standard if the material is to
be shared with users who may have different local information
processing standards.
This internationalist approach can be contrasted to the
philosophy behind 8859/5, where, as we shall see, general re-
quirements for dealing with multilingual Slavic Cyrillic texts
have been needlessly subordinated to local, strictly Russian con-
cerns.(5) It would have been more sensible to expect Russians to
use their own well-established national standard locally, but to
compromise on an international interchange standard that is truly
international.

The Independence of Binary Representations
from Keyboards, Monitors, and Printers

Humanists unfamiliar with computers often fail to realize
that the internal binary representation(6) of a character set is
completely independent of keyboard layouts and screen or printer
displays. Striking a specific key on a keyboard generates a
hardware scan code,(7) which is not the same as the binary repre-
sentation of a character. The operating system is then
responsible for interpreting the scan code, checking for shift or
control keys and other details, and generating a binary character
representation. Typing a lower case {a} on an IBM PC generates a
scan code of 1Eh (30) with no shift mask, which the BIOS will
translate into 61h (97). This translation can be modified by the
user, so that the physical location of a certain letter on a key-
board may determine the scan code generated, but this scan code
is irrelevant to the binary representation that will be assigned
to that character.
Similarly, the relationship between the internal repre-
sentation of a character and its screen or printed display can be
defined separately by the user to suit the application. To con-
tinue the preceding example, the binary representation 61h does
not have to put a lower case {a} on the screen. The user can
revise the relationship between binary representations and the
character display just as he can revise the relationship between
keyboard scan codes and binary representations.
Although technically more complicated than simple remap-
ping, there are situations where it is useful to allow a single
binary character representation to correspond to multiple screen
or printer representations. For example, most letters of the
Arabic alphabet have four separate shapes, depending on whether
they appear in isolation or at the beginning, middle, or end of a
word (or of a sequence of connected letters). In the dark days
of typewriters, it was necessary for the typist to use multiple
shift keys to enter the correct form of the character to be dis-
played. The most efficient scheme for encoding such contextually
dependent information today is to store each Arabic letter as a
separate binary code, and to make the display or printing soft-
ware responsible for selecting the appropriate graphic
variant.(8)

Efficient Use of 8-Bit Systems

One initially encouraging decision reflected in 8859/5 is
the use of an 8-bit representation, providing 256 characters per
set instead of the 128 available in a 7-bit standard. But this
sensible procedure is vitiated by the decision to retain Latin
characters (with standard ASCII assignments) in the lower half of
all 8859 sets, so that at most 128 positions could be available
for Cyrillic characters. An additional sixty-four positions of
every set are needlessly reserved for control characters, reduc-
ing the actual number of slots potentially available for Cyrillic
characters to ninety-six.(9) This is barely enough to encode up-
per and lower case variants of all letters used in the modern
Slavic languages.
Most documents are monolingual, do not combine Latin and
Cyrillic, and would be better served if a larger inventory was
available for the relevant alphabet. Multiple-alphabet documents
could be accommodated by a standard for switching between non-
overlapping sets of 256 characters. Such a standard will be
necessary in any case for documents that combine, for example,
Cyrillic (8859/5) with Greek (8859/7).(10) Serbocroatian is
unique among the Slavic languages in its official use of both the
Latin and Cyrillic alphabets, which means that documents includ-
ing both versions would require 8859/5 for the Cyrillic portion
and 8859/1(11) for the Latin. Another advantage of combining
character sets is that control characters could be defined for
only a single set, which would open additional positions in the
extended alphabet sets. Although 8859/5 is technically an 8-bit
system, the combination of Latin and Cyrillic in a single set and
the prodigal assignment of control characters results in the same
limitations that constitute the principal liabilities of 7-bit
systems.
Although all of the standard characters of the modern
languages are included in ISO 8859/5, a system providing more
positions could be put to wider use. For example, there is no
room in 8859/5 for European quotation marks (guillemets), which
are a regular feature of Cyrillic typography.(12) Additionally,
8859/5 was designed only for *modern* Slavic Cyrillic languages
and is inadequate even for basic work with historical sources
that use a slightly different character set than the modern lan-
guages. The Russian alphabet was reformed in 1918 and the Bul-
garian one as late as 1945; in both cases letters were deleted
that would be useful to people working with earlier sources. The
Ukrainian alphabet includes a separate 'g' character, the use of
which has at times been considered an act of sedition by Soviet
authorities and a mark of national pride by many Ukrainians, par-
ticularly in the west. Even if the matter were not politically
sensitive, there are no free positions in 8859/5 for this and
other obsolete letters that are important for work with histori-
cal sources.

The Internationality of International Standards

Although 8859/5 purports to be an international standard
for all modern Slavic languages that use the Cyrillic alphabet,
it is needlessly and offensively Russocentric. The ISO is under-
standably concerned with maintaining compatibility with accepted
national standards, but this concern should be paramount only for
monolingual standards. 8859/5 is not supposed to be a Russian
standard and it should have been established by a disinterested
evaluation of the requirements for dealing with six languages,
rather than by slavishly adopting a Russian national system at
the expense of the other languages.
Those who are familiar with Russian will note that
columns 11 through 14 of 8859/5 contain the letters of the Rus-
sian alphabet in order. The thirty-third Russian character, the
{e} with diaresis is tucked away on the side. In almost all Rus-
sian writing, the diaresis is omitted and this letter is treated
as identical to {e}, so that it is, in some respects, a marginal
part of the Russian alphabet and a good candidate for special
treatment.

Case Folding in Multialphabet Sets

Reducing the 33-character Russian alphabet to 32 is
desirable not only because one letter is orthographically
marginal, but because 32 is a convenient number for binary com-
puters and can facilitate case folding. Note, however, that the
Russian characters begin in an odd-numbered column, while the
Latin characters begin in an even-numbered one, which means that
Latin and Cyrillic case folding require different algorithms.(13)
If the Russian alphabet is to be reduced to 32 characters to fa-
cilitate case folding, it would seem sensible in a two-alphabet
character set to establish a mapping that would allow a single
procedure to accomplish case folding for both alphabets.
Of the remaining languages served by 8859/5, only the
Bulgarian alphabet is a perfect subset of the Russian. Ukraini-
an, Belorussian, Macedonian, and Serbocroatian all include addi-
tional characters not present in Russian. In 8859/5 these have
been tucked away in columns 10 and 15. This entails yet a third
relationship between upper and lower case and means that case
folding even for monolingual texts in languages other than Rus-
sian requires two separate procedures, one to fold 11 and 12 in
with 13 and 14 and another to fold 10 and 15 together.(14)

Character Order

One advantage to following alphabetic order in character
coding is that it enables alphabetic sorting by comparing strings
according to machine order. This type of unfiltered sorting in
8859/5 is impossible for Ukrainian, Belorussian, Serbocroatian,
or Macedonian, since the characters from columns 10 and 15 would
have to be inserted into their proper places. This is a com-
pletely unnecessary limitation, because with one minor excep-
tion(15) all modern Slavic languages that use the Cyrillic al-
phabet follow a single order. Not all characters will occur in
each language, but a single order for the entire character set
would have made it possible to sort all languages in machine or-
der.(16)

The Problem of 8859/5 as an International Standard

The upper half of 8859/5 is an excellent example of how
not to organize an international standard. It is an imperfect
Russian national standard that is poorly suited to the other
Slavic languages it is supposed to represent. As I mentioned
earlier, standards for local information processing may differ
from standards for international information interchange and a
Russian writing exclusively in Russian should use the resources
that best answer his requirements. A multilingual international
standard, on the other hand, should balance the requirements of
all the languages involved. Filtering text files to convert be-
tween local and international standards is not difficult and to
favor one national system over all others as a basis for an in-
ternational interchange standard is not justifiable technically,
intellectually, or diplomatically.(17)

Alternative Standards

If we abandon the idea of combining Latin and Cyrillic
into a single 8-bit set, it is possible to deal more effectively
with Cyrillic requirements. One possible approach is that imple-
mented in the ISO 6861 draft standard, which provides a system of
extended Cyrillic sets that incorporates most of the characters
required for work with modern and medieval Slavic sources and
Romanian Cyrillic.(18) A standard control sequence can be used
to select the appropriate set for an application, as well as to
switch sets within a single text.
Another desirable extension of the Cyrillic inventory
would be the addition of characters from non-Slavic languages of
the Soviet Union that use the Cyrillic alphabet. Either the
medieval letters of the 6861 draft standard or an extended modern
Cyrillic set would provide a more efficient use of character
positions than the combination of Latin and Cyrillic found in
8859/5. A single set, similar to ISO 8859/1, could serve for
most of the Latin alphabet languages of Europe, while other sets
could provide better support for languages using Cyrillic.
This type of approach, which overcomes the limitations
inherent in any 8-bit set, which can have room for no more than
256 characters, is exemplified by the recent multi-octet (or
multiple-byte) ISO draft proposal 10646. This three-dimensional
representation has room for over 16 million characters, each of
which could be fully specified by three bytes. Of course, a
three-byte representation would be wasteful for most applications
and the preliminary description of the standard includes modifi-
cations that would permit simpler representations when appropri-
ate. These include:

1) a two-octet form, restricted exclusively to a single
plane, which would suffice for most purely alphabetic ap-
plications;

2) a compacted form, permitting strings of related charac-
ters to be used as single-octets.

According to this latter modification, a string of Cyrillic
characters with two of the three octets in common could be
represented by a control sequence indicating that those two would
be in force until further notice, whereupon the specific individ-
ual characters could be identified merely by supplying the third
octet.

Conclusions

An ideal multilingual international standard would not
combine completely different alphabets, such as Latin and Cyril-
lic, into a single character set. It should not be designed
around the requirements of one language when an alternative is
available that serves all the languages with equal effectiveness.
If case folding is a priority, it should be implemented uniformly
throughout the set. If arranging characters in sorting order is
a priority, a mapping that supports all the languages equally
should be favored. Restrictions imposed by specific hardware and
software configurations, as well as conformity to existing na-
tional standards, which may be of primary importance for local
information processing, should not dictate international stan-
dards for information interchange. Continuity with existing na-
tional and international standards is desirable, but this desire
for compatibility should not allow obsolescent decisions to
retard the development of new standards that could better exploit
new resources.

Notes

1) I am grateful to Steven J. DeRose for help in obtaining in-
formation about ISO standards and especially to Harry Gaylord for
both help with materials and stimulating comments on many of the
issues mentioned here.

2) The most frequently encountered alternative is the Extended
Binary Coded Decimal Interchange Code (EBCDIC), which is used
primarily on IBM mainframes. Although the alphanumeric charac-
ters of ASCII and EBCDIC correspond, small differences between
EBCDIC variants (as well as variants in ASCII coding) make trans-
lation between ASCII and EBCDIC perilous and greatly complicate
the transfer of files between, e.g., Internet and Bitnet sites.

3) ISO 8859/5 has been adopted by the European Computer Manufac-
turers Association as their Standard ECMA-113 (2nd edition, July
1988, adopted by the General Assembly of the ECMA on 30 June
1988).

4) As an example of a hardware limitation, some display adapters
do not treat all characters identically. A number of MS-DOS
software packages use characters between B0h and DFh (176--223)
for lines and borders. In the traditional PC text display, all
characters are nine pixels wide, but only the eight leftmost
columns can be defined by the user. For characters between B0h
and DFh, the eighth pixel from the left is automatically dupli-
cated in the rightmost column, while for characters outside this
range, the rightmost column is blank. This enables the graphics
characters in the B0h--DFh range to connect, which is convenient
for continuous lines and borders. Unfortunately, this means that
any user-defined alphabetic characters assigned to this range
must be no more than seven pixels wide, since an 8-pixel wide
character would bleed into the rightmost column.

5) 8859/5 is based on the 1987 revision of the Soviet national
GOST Standard 19768.

6) I use the term "binary representation" to designate the ma-
chine coding for a character. Most standards implement 8-bit
representations, although an alternative is discussed later in
this paper.

7) For example, scan codes on IBM PCs essentially reflect the
physical order of the keys on the keyboard. New keyboard designs
have caused some keycaps to be moved, but old scan code assign-
ments were retained for compatibility. For example, the back-
slash key continues to generate a 2Bh even as it moves from one
location to another with each revision of the keyboard.

8) This simplifies data entry and editing, as well as sorting.
An escape code would be required to display a character outside
its usual context, but this extraordinarily rare situation cannot
justify commandeering four binary representations for every let-
ter of the alphabet. ISO 10646, which I discuss below, will use
a single character in the text file and allow the application to
transform it as appropriate for screen and printer output.

9) Proposals submitted to the ISO for 8-bit sets with a minimal
number of control characters (5 or less) have been resoundingly
rejected by most of the national delegations.
Many ISO standards continue to be influenced by
anachronistic concerns. Following the provisions of ISO 2022, 8-
bit standards are treated as two pages of 128 characters, rather
than one page of 256. The 256 characters of 8859 and other 8-bit
standards are divided into four sections: C0 (00/00--01/15), G0
(02/00--07/15), C1 (08/00--09/15), and G1 (10/0--15/15). C0 and
C1 are control sections and reserve thirty-two positions for con-
trol characters. G0 and G1 are available for two sets of
graphics characters, each containing up to 96 items.
Another striking anachronism that whittles 96 items down
to 95 is the designation of 07/15 as a control character. This
character, traditionally called delete (DEL), was previously used
to erase or obliterate erroneous or unwanted characters in
punched tape. There is no justification for reserving this posi-
tion today when the limited number of positions available for
characters in a multi-language standard is already so restricted.
Nonetheless, ISO DIS 6861 and DP 10646 reserve both 07/15 and
15/15 for control functions. ISO 8859/5, curiously, reserves
only 07/15, while assigning an alphabetic character to 15/15.
Although the number and coding of control characters can
be reduced with no loss of information, any such decision should
be taken in conjunction with a revision of International Telecom-
munications Union CCITT protocols, which use control characters
to regulate the transmission of digital information.

10) Equivalent to ECMA--118.

11) Equivalent to ECMA--94/1, second edition (June 1986). ISO
2022, Information processing --- ISO 7-bit and 8-bit coded
character sets --- Code extension techniques, establishes stan-
dards for switching among character sets within a document. ISO
Draft Proposal 4873 (currently being revised) also deals with
switching among C0, G0, C1, and G1.

12) Oddly enough, many Soviet Russian standards omit European
quotation marks. There is also no provision in 8859/5 for mark-
ing accented vowels, which might be required for textbooks, dic-
tionaries, or linguistic studies. The 8859 standards forbid
overstriking, so that any combination of character plus diacritic
must have a single binary representation. 8859/5 hardly has room
for accent marks, let alone fully formed accented vowel letters.
The 7-bit 646 standard, now under revision, allowed for
the use of a backspace combined with diacritics, which could be
entered after alphabetic characters. Other standards allowed for
nonspacing diacritics, similar to dead keys, which could be en-
tered before alphabetic characters.

13) A string of Latin alphabet text can be converted to lower
case by setting bit 6, which effectively adds 32 to the upper
case characters while leaving the lower case unchanged. The same
string can be converted to upper case by clearing bit 6. To con-
vert a string of Russian text to lower case requires setting bit
7 and toggling bit 6. Converting Russian text to upper case is
more complicated still. Note that ISO conventions call for num-
bering rows and columns in decimal from 00--15, rather than in
hexadecimal, and for numbering bits 1--8, instead of the more
common 0--7.

14) The last procedure involves setting (or clearing) bits 5 and
7.

15) The soft sign falls at the end of the alphabet in Ukrainian.
This letter never occurs in initial position in any Slavic lan-
guage and it is close to the end of the alphabet in the other
languages, so that this peculiarity of Ukrainian will have little
effect in real applications. On the other hand, the order of the
Cyrillic Old Church Slavonic alphabet, used for medieval texts,
differs in several places from the order in the modern languages,
so that even if the Old Church Slavonic characters were added to
the Cyrillic inventory, a different sorting algorithm would be
required.

16) According to John Clews, Language automation worldwide: the
development of character set standards, Harrogate: Sesame, 1988,
Section 5.1, transliteration standards exploiting this feature have been well known for over twenty years.

17) Lamentably, the fate of ISO standards depends on the voting
of national committees that may be more concerned with national
prestige than with enacting efficient international standards.
It is reported that an effort to establish a single character set
for the Far East with no duplication met with threats by China,
Japan, and Korea to withdraw if their entire national standards,
duplicate characters and all, were not included. It is unlikely
that sensible international standards will ever emerge from such
a chauvinist atmosphere; academic projects, such as the Text En-
coding Initiative, are more promising.

18) 6861 also includes a Glagolitic set, with characters as-
signed to the same positions as their alphabetic equivalents in
Cyrillic, a layout that facilitates transliterating between
Glagolitic and Cyrillic. There seems to be some uncertainty in
6861 about how to distinguish differences in character sets from
differences in typefaces, but the principle of not squandering
the limited inventory of binary representations available in a
Cyrillic set on Latin characters is sound.

tml@hemuli.atk.vtt.fi (Tor Lillqvist) (11/24/89)

In article <433@wjh12.harvard.edu> djb@wjh12.UUCP (David J. Birnbaum) writes:

>        Reducing the 33-character Russian alphabet to 32 is
>desirable not only because one letter is orthographically
>marginal, but because 32 is a convenient number for binary com-
>puters and can facilitate case folding.  Note, however, that the
>Russian characters begin in an odd-numbered column, while the
>Latin characters begin in an even-numbered one, which means that
>Latin and Cyrillic case folding require different algorithms.(13)

If we are considering future standards and trends, I think it is
irrelevant that the traditional 7-bit ASCII seems to enable case
folding by a simple addition/subtraction of a constant value.  The
same goes for ISO8859/1 and /5 (Latin 1 and Slavic (or whatever it's
called)).  Surely all software designed to follow local custom and
typesetting rules must use more sphisticated table-driven case folding
and collating algorithms.  There are many obscure special cases in
different languages.

One could maybe even go as far as saying that it was a Bad Thing that
ASCII was degined so that the letters are in (English) alphabetic
order.  If they had been in random order, some standard string case
folding and comparison programming language interface would have been
developed earlier.

(Having said this, I must admit thay I use traditional strcmp, strlwr
and <ctype.h> programming practice all the time, even though the HP-UX
system I use has this NLS stuff.)
-- 
Tor Lillqvist, VTT/ATK

sommar@enea.se (Erland Sommarskog) (11/26/89)

David J. Birnbaum (djb@wjh12.UUCP) criticized ISO 8859/5 in
a long article in this newsgroup. I'm inclined to agree
with him on many points. I wouldn't say I completely satisfied
with the conecpt of Latin-1, Latin-2 etc. As a covering standard
6937 seems much more appealing. However, 8859 is here to stay
for a while, and I think it's just to accept it as it is.
After all 8859 is a lot better than ASCII alone.

Mr. Birnbaum mainly focuses at the cyrillic set, but many of the
problems he discusses concerns the latin sets as well. I will
only cover one of the here, the one of collation order.

>                         Character Order
>
>        One advantage to following alphabetic order in character
>coding is that it enables alphabetic sorting by comparing strings
>according to machine order.  This type of unfiltered sorting in
>8859/5 is impossible for Ukrainian, Belorussian, Serbocroatian,
>or Macedonian, since the characters from columns 10 and 15 would
>have to be inserted into their proper places.  This is a com-
>pletely unnecessary limitation, because with one minor excep-
>tion(15) all modern Slavic languages that use the Cyrillic al-
>phabet follow a single order.  Not all characters will occur in
>each language, but a single order for the entire character set
>would have made it possible to sort all languages in machine or-
>der.(16)

The truth is that a single enumeration doesn't apply at all for many 
languages. Dotted "A" and dotted "O" are separate letters in
Swedish, but in German they are to be co-sorted with "A" and "O"
or as "AE" and "OE". Same goes for accented letters in many languages.
The conclusion of this is that software sorting packages are needed 
that can be customized to the desired with common languages pre-
defined.
  Given this, it doesn't feel very important that the cyrrilic
language would be honored a particular order.

As some other poster, I think it was Tor Lillqvist, said, the best
would have been if ASCII had taken the letter in random order.
-- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se

donn@hpfcdc.HP.COM (Donn Terry) (11/28/89)

Since I havn't seen it mentioned yet:  IEEE 1003.2 (POSIX.2) which is
currently in balloting addresses the issue of collation, case shift,
case-independent comparison, etc.  reasonably well.  Clearly it handles
issues such as collation order being distinct from codeset order, and it
also handles at least some of the strangeness with German sharp-s and
Spanish ch and ll.  It's being refined at the moment, and it seems quite
possible that it will handle any problem short of sorting words with the
same spelling and different pronunciation sorting differently (which is
NOT necessarily the worst problem).  I don't yet know for sure whether
it will handle what I am told Thai does: collate on first vowel.

None of the problems in Birnbaum's paper seemed at all difficult for
POSIX.2 internationalization to handle.

Donn Terry
HP Ft. Collins.
(Oh yeah... and U.S. Internationalization rapporteur for SC22/WG15
among other silly titles.)