djb@wjh12.harvard.edu (David J. Birnbaum) (11/21/89)
Note: The following is a slightly revised version of a paper pre- sented earlier this year at the Fourth International Conference on Symbolic and Logical Computing, held at Dakota State Univer- sity, Madison, South Dakota. Endnote numbers within the text are enclosed in parentheses. Readers may wish to consult a character map for ISO 8859/5 (= ECMA 113). While some computational issues mentioned here will be obvious to USENET readers, I hope the philological and linguistic perspective will prove interesting. ================================================================= Issues in Developing International Standards for Encoding non-Latin Alphabets(1) David J. Birnbaum Department of Slavic Languages, University of Pittsburgh Russian Research Center, Harvard University djb@wjh12.harvard.edu [Internet] djb@harvunxw.bitnet [Bitnet] Copyright (c) 1989 by David J. Birnbaum All rights reserved Introduction Defining an appropriate character set is the most impor- tant preliminary to any text processing. The generally accepted system for encoding English language texts is the American Stan- dard Code for Information Interchange (ASCII),(2) but the devel- opment of appropriate standards for other languages and alphabets has been less successful. As a result of this lack of agreement, idiosyncratic systems have proliferated, producing predictable obstacles to the efficient exchange of data. Recently the International Standards Organization (ISO) promulgated the 8859 series of standards for a variety of writing systems. One of these standards, 8859/5,(3) is designed to serve all six modern Slavic languages that use the Cyrillic alphabet (Russian, Ukrainian, Belorussian, Bulgarian, Macedonian, and Ser- bocroatian). My discussion today focuses on general methodologi- cal issues involved in determining appropriate international standards, which I illustrate through a specific critique of 8859/5. The theoretical issues involved include 1) the languages to be covered; 2) the alphabets for these languages; 3) the uses to be served by the standard; 4) the coding for the individual characters. Additional problems that influence issue 4 include 1) 7-bit or 8-bit sets or other representations; 2) compatibility with other existing standards for the same languages; 3) compatibility with character sets for other languages; 4) balancing priorities of different languages combined in a single set; 5) upper and lower case relationships; 6) sorting and string comparison order; 7) differences in information processing and information in- terchange requirements; 8) hardware and software limitations. Information Processing and Information Interchange: the Difference between Local and International Standards One important preliminary consideration is that informa- tion processing and information interchange may require different standards. Information processing is a local concern, where com- patibility with existing local standards may be important. Fur- thermore, users may be constrained not only by local conventions, but also by specific hardware or software configurations and any peculiarities of their texts. For example, different operating systems, printers, and application software may reserve particu- lar control characters.(4) For my own research, which requires fairly exact transcriptions of orthographically complex medieval Cyrillic manuscripts, I have to provide for a variety of non- standard characters, ligatures, and diacritics; other users will customize their systems in different ways. It is difficult for a single international Slavic Cyrillic standard to anticipate all the needs of all users, but there is no real impediment to devel- oping a sensible standard for modern languages. Since the optimal solution to a specific limited problem may be incompatible with a more general standard that must serve a wider range of users, the most reasonable compromise is that standards for information interchange should be designed to serve as many uses as possible as efficiently as possible, with whatever compromises that entails. Local standards for informa- tion processing, on the other hand, should be designed separately to deal effectively with specific local tasks. Filtering text files is a trivial matter and users can easily convert local formats to an accepted interchange standard if the material is to be shared with users who may have different local information processing standards. This internationalist approach can be contrasted to the philosophy behind 8859/5, where, as we shall see, general re- quirements for dealing with multilingual Slavic Cyrillic texts have been needlessly subordinated to local, strictly Russian con- cerns.(5) It would have been more sensible to expect Russians to use their own well-established national standard locally, but to compromise on an international interchange standard that is truly international. The Independence of Binary Representations from Keyboards, Monitors, and Printers Humanists unfamiliar with computers often fail to realize that the internal binary representation(6) of a character set is completely independent of keyboard layouts and screen or printer displays. Striking a specific key on a keyboard generates a hardware scan code,(7) which is not the same as the binary repre- sentation of a character. The operating system is then responsible for interpreting the scan code, checking for shift or control keys and other details, and generating a binary character representation. Typing a lower case {a} on an IBM PC generates a scan code of 1Eh (30) with no shift mask, which the BIOS will translate into 61h (97). This translation can be modified by the user, so that the physical location of a certain letter on a key- board may determine the scan code generated, but this scan code is irrelevant to the binary representation that will be assigned to that character. Similarly, the relationship between the internal repre- sentation of a character and its screen or printed display can be defined separately by the user to suit the application. To con- tinue the preceding example, the binary representation 61h does not have to put a lower case {a} on the screen. The user can revise the relationship between binary representations and the character display just as he can revise the relationship between keyboard scan codes and binary representations. Although technically more complicated than simple remap- ping, there are situations where it is useful to allow a single binary character representation to correspond to multiple screen or printer representations. For example, most letters of the Arabic alphabet have four separate shapes, depending on whether they appear in isolation or at the beginning, middle, or end of a word (or of a sequence of connected letters). In the dark days of typewriters, it was necessary for the typist to use multiple shift keys to enter the correct form of the character to be dis- played. The most efficient scheme for encoding such contextually dependent information today is to store each Arabic letter as a separate binary code, and to make the display or printing soft- ware responsible for selecting the appropriate graphic variant.(8) Efficient Use of 8-Bit Systems One initially encouraging decision reflected in 8859/5 is the use of an 8-bit representation, providing 256 characters per set instead of the 128 available in a 7-bit standard. But this sensible procedure is vitiated by the decision to retain Latin characters (with standard ASCII assignments) in the lower half of all 8859 sets, so that at most 128 positions could be available for Cyrillic characters. An additional sixty-four positions of every set are needlessly reserved for control characters, reduc- ing the actual number of slots potentially available for Cyrillic characters to ninety-six.(9) This is barely enough to encode up- per and lower case variants of all letters used in the modern Slavic languages. Most documents are monolingual, do not combine Latin and Cyrillic, and would be better served if a larger inventory was available for the relevant alphabet. Multiple-alphabet documents could be accommodated by a standard for switching between non- overlapping sets of 256 characters. Such a standard will be necessary in any case for documents that combine, for example, Cyrillic (8859/5) with Greek (8859/7).(10) Serbocroatian is unique among the Slavic languages in its official use of both the Latin and Cyrillic alphabets, which means that documents includ- ing both versions would require 8859/5 for the Cyrillic portion and 8859/1(11) for the Latin. Another advantage of combining character sets is that control characters could be defined for only a single set, which would open additional positions in the extended alphabet sets. Although 8859/5 is technically an 8-bit system, the combination of Latin and Cyrillic in a single set and the prodigal assignment of control characters results in the same limitations that constitute the principal liabilities of 7-bit systems. Although all of the standard characters of the modern languages are included in ISO 8859/5, a system providing more positions could be put to wider use. For example, there is no room in 8859/5 for European quotation marks (guillemets), which are a regular feature of Cyrillic typography.(12) Additionally, 8859/5 was designed only for *modern* Slavic Cyrillic languages and is inadequate even for basic work with historical sources that use a slightly different character set than the modern lan- guages. The Russian alphabet was reformed in 1918 and the Bul- garian one as late as 1945; in both cases letters were deleted that would be useful to people working with earlier sources. The Ukrainian alphabet includes a separate 'g' character, the use of which has at times been considered an act of sedition by Soviet authorities and a mark of national pride by many Ukrainians, par- ticularly in the west. Even if the matter were not politically sensitive, there are no free positions in 8859/5 for this and other obsolete letters that are important for work with histori- cal sources. The Internationality of International Standards Although 8859/5 purports to be an international standard for all modern Slavic languages that use the Cyrillic alphabet, it is needlessly and offensively Russocentric. The ISO is under- standably concerned with maintaining compatibility with accepted national standards, but this concern should be paramount only for monolingual standards. 8859/5 is not supposed to be a Russian standard and it should have been established by a disinterested evaluation of the requirements for dealing with six languages, rather than by slavishly adopting a Russian national system at the expense of the other languages. Those who are familiar with Russian will note that columns 11 through 14 of 8859/5 contain the letters of the Rus- sian alphabet in order. The thirty-third Russian character, the {e} with diaresis is tucked away on the side. In almost all Rus- sian writing, the diaresis is omitted and this letter is treated as identical to {e}, so that it is, in some respects, a marginal part of the Russian alphabet and a good candidate for special treatment. Case Folding in Multialphabet Sets Reducing the 33-character Russian alphabet to 32 is desirable not only because one letter is orthographically marginal, but because 32 is a convenient number for binary com- puters and can facilitate case folding. Note, however, that the Russian characters begin in an odd-numbered column, while the Latin characters begin in an even-numbered one, which means that Latin and Cyrillic case folding require different algorithms.(13) If the Russian alphabet is to be reduced to 32 characters to fa- cilitate case folding, it would seem sensible in a two-alphabet character set to establish a mapping that would allow a single procedure to accomplish case folding for both alphabets. Of the remaining languages served by 8859/5, only the Bulgarian alphabet is a perfect subset of the Russian. Ukraini- an, Belorussian, Macedonian, and Serbocroatian all include addi- tional characters not present in Russian. In 8859/5 these have been tucked away in columns 10 and 15. This entails yet a third relationship between upper and lower case and means that case folding even for monolingual texts in languages other than Rus- sian requires two separate procedures, one to fold 11 and 12 in with 13 and 14 and another to fold 10 and 15 together.(14) Character Order One advantage to following alphabetic order in character coding is that it enables alphabetic sorting by comparing strings according to machine order. This type of unfiltered sorting in 8859/5 is impossible for Ukrainian, Belorussian, Serbocroatian, or Macedonian, since the characters from columns 10 and 15 would have to be inserted into their proper places. This is a com- pletely unnecessary limitation, because with one minor excep- tion(15) all modern Slavic languages that use the Cyrillic al- phabet follow a single order. Not all characters will occur in each language, but a single order for the entire character set would have made it possible to sort all languages in machine or- der.(16) The Problem of 8859/5 as an International Standard The upper half of 8859/5 is an excellent example of how not to organize an international standard. It is an imperfect Russian national standard that is poorly suited to the other Slavic languages it is supposed to represent. As I mentioned earlier, standards for local information processing may differ from standards for international information interchange and a Russian writing exclusively in Russian should use the resources that best answer his requirements. A multilingual international standard, on the other hand, should balance the requirements of all the languages involved. Filtering text files to convert be- tween local and international standards is not difficult and to favor one national system over all others as a basis for an in- ternational interchange standard is not justifiable technically, intellectually, or diplomatically.(17) Alternative Standards If we abandon the idea of combining Latin and Cyrillic into a single 8-bit set, it is possible to deal more effectively with Cyrillic requirements. One possible approach is that imple- mented in the ISO 6861 draft standard, which provides a system of extended Cyrillic sets that incorporates most of the characters required for work with modern and medieval Slavic sources and Romanian Cyrillic.(18) A standard control sequence can be used to select the appropriate set for an application, as well as to switch sets within a single text. Another desirable extension of the Cyrillic inventory would be the addition of characters from non-Slavic languages of the Soviet Union that use the Cyrillic alphabet. Either the medieval letters of the 6861 draft standard or an extended modern Cyrillic set would provide a more efficient use of character positions than the combination of Latin and Cyrillic found in 8859/5. A single set, similar to ISO 8859/1, could serve for most of the Latin alphabet languages of Europe, while other sets could provide better support for languages using Cyrillic. This type of approach, which overcomes the limitations inherent in any 8-bit set, which can have room for no more than 256 characters, is exemplified by the recent multi-octet (or multiple-byte) ISO draft proposal 10646. This three-dimensional representation has room for over 16 million characters, each of which could be fully specified by three bytes. Of course, a three-byte representation would be wasteful for most applications and the preliminary description of the standard includes modifi- cations that would permit simpler representations when appropri- ate. These include: 1) a two-octet form, restricted exclusively to a single plane, which would suffice for most purely alphabetic ap- plications; 2) a compacted form, permitting strings of related charac- ters to be used as single-octets. According to this latter modification, a string of Cyrillic characters with two of the three octets in common could be represented by a control sequence indicating that those two would be in force until further notice, whereupon the specific individ- ual characters could be identified merely by supplying the third octet. Conclusions An ideal multilingual international standard would not combine completely different alphabets, such as Latin and Cyril- lic, into a single character set. It should not be designed around the requirements of one language when an alternative is available that serves all the languages with equal effectiveness. If case folding is a priority, it should be implemented uniformly throughout the set. If arranging characters in sorting order is a priority, a mapping that supports all the languages equally should be favored. Restrictions imposed by specific hardware and software configurations, as well as conformity to existing na- tional standards, which may be of primary importance for local information processing, should not dictate international stan- dards for information interchange. Continuity with existing na- tional and international standards is desirable, but this desire for compatibility should not allow obsolescent decisions to retard the development of new standards that could better exploit new resources. Notes 1) I am grateful to Steven J. DeRose for help in obtaining in- formation about ISO standards and especially to Harry Gaylord for both help with materials and stimulating comments on many of the issues mentioned here. 2) The most frequently encountered alternative is the Extended Binary Coded Decimal Interchange Code (EBCDIC), which is used primarily on IBM mainframes. Although the alphanumeric charac- ters of ASCII and EBCDIC correspond, small differences between EBCDIC variants (as well as variants in ASCII coding) make trans- lation between ASCII and EBCDIC perilous and greatly complicate the transfer of files between, e.g., Internet and Bitnet sites. 3) ISO 8859/5 has been adopted by the European Computer Manufac- turers Association as their Standard ECMA-113 (2nd edition, July 1988, adopted by the General Assembly of the ECMA on 30 June 1988). 4) As an example of a hardware limitation, some display adapters do not treat all characters identically. A number of MS-DOS software packages use characters between B0h and DFh (176--223) for lines and borders. In the traditional PC text display, all characters are nine pixels wide, but only the eight leftmost columns can be defined by the user. For characters between B0h and DFh, the eighth pixel from the left is automatically dupli- cated in the rightmost column, while for characters outside this range, the rightmost column is blank. This enables the graphics characters in the B0h--DFh range to connect, which is convenient for continuous lines and borders. Unfortunately, this means that any user-defined alphabetic characters assigned to this range must be no more than seven pixels wide, since an 8-pixel wide character would bleed into the rightmost column. 5) 8859/5 is based on the 1987 revision of the Soviet national GOST Standard 19768. 6) I use the term "binary representation" to designate the ma- chine coding for a character. Most standards implement 8-bit representations, although an alternative is discussed later in this paper. 7) For example, scan codes on IBM PCs essentially reflect the physical order of the keys on the keyboard. New keyboard designs have caused some keycaps to be moved, but old scan code assign- ments were retained for compatibility. For example, the back- slash key continues to generate a 2Bh even as it moves from one location to another with each revision of the keyboard. 8) This simplifies data entry and editing, as well as sorting. An escape code would be required to display a character outside its usual context, but this extraordinarily rare situation cannot justify commandeering four binary representations for every let- ter of the alphabet. ISO 10646, which I discuss below, will use a single character in the text file and allow the application to transform it as appropriate for screen and printer output. 9) Proposals submitted to the ISO for 8-bit sets with a minimal number of control characters (5 or less) have been resoundingly rejected by most of the national delegations. Many ISO standards continue to be influenced by anachronistic concerns. Following the provisions of ISO 2022, 8- bit standards are treated as two pages of 128 characters, rather than one page of 256. The 256 characters of 8859 and other 8-bit standards are divided into four sections: C0 (00/00--01/15), G0 (02/00--07/15), C1 (08/00--09/15), and G1 (10/0--15/15). C0 and C1 are control sections and reserve thirty-two positions for con- trol characters. G0 and G1 are available for two sets of graphics characters, each containing up to 96 items. Another striking anachronism that whittles 96 items down to 95 is the designation of 07/15 as a control character. This character, traditionally called delete (DEL), was previously used to erase or obliterate erroneous or unwanted characters in punched tape. There is no justification for reserving this posi- tion today when the limited number of positions available for characters in a multi-language standard is already so restricted. Nonetheless, ISO DIS 6861 and DP 10646 reserve both 07/15 and 15/15 for control functions. ISO 8859/5, curiously, reserves only 07/15, while assigning an alphabetic character to 15/15. Although the number and coding of control characters can be reduced with no loss of information, any such decision should be taken in conjunction with a revision of International Telecom- munications Union CCITT protocols, which use control characters to regulate the transmission of digital information. 10) Equivalent to ECMA--118. 11) Equivalent to ECMA--94/1, second edition (June 1986). ISO 2022, Information processing --- ISO 7-bit and 8-bit coded character sets --- Code extension techniques, establishes stan- dards for switching among character sets within a document. ISO Draft Proposal 4873 (currently being revised) also deals with switching among C0, G0, C1, and G1. 12) Oddly enough, many Soviet Russian standards omit European quotation marks. There is also no provision in 8859/5 for mark- ing accented vowels, which might be required for textbooks, dic- tionaries, or linguistic studies. The 8859 standards forbid overstriking, so that any combination of character plus diacritic must have a single binary representation. 8859/5 hardly has room for accent marks, let alone fully formed accented vowel letters. The 7-bit 646 standard, now under revision, allowed for the use of a backspace combined with diacritics, which could be entered after alphabetic characters. Other standards allowed for nonspacing diacritics, similar to dead keys, which could be en- tered before alphabetic characters. 13) A string of Latin alphabet text can be converted to lower case by setting bit 6, which effectively adds 32 to the upper case characters while leaving the lower case unchanged. The same string can be converted to upper case by clearing bit 6. To con- vert a string of Russian text to lower case requires setting bit 7 and toggling bit 6. Converting Russian text to upper case is more complicated still. Note that ISO conventions call for num- bering rows and columns in decimal from 00--15, rather than in hexadecimal, and for numbering bits 1--8, instead of the more common 0--7. 14) The last procedure involves setting (or clearing) bits 5 and 7. 15) The soft sign falls at the end of the alphabet in Ukrainian. This letter never occurs in initial position in any Slavic lan- guage and it is close to the end of the alphabet in the other languages, so that this peculiarity of Ukrainian will have little effect in real applications. On the other hand, the order of the Cyrillic Old Church Slavonic alphabet, used for medieval texts, differs in several places from the order in the modern languages, so that even if the Old Church Slavonic characters were added to the Cyrillic inventory, a different sorting algorithm would be required. 16) According to John Clews, Language automation worldwide: the development of character set standards, Harrogate: Sesame, 1988, Section 5.1, transliteration standards exploiting this feature have been well known for over twenty years. 17) Lamentably, the fate of ISO standards depends on the voting of national committees that may be more concerned with national prestige than with enacting efficient international standards. It is reported that an effort to establish a single character set for the Far East with no duplication met with threats by China, Japan, and Korea to withdraw if their entire national standards, duplicate characters and all, were not included. It is unlikely that sensible international standards will ever emerge from such a chauvinist atmosphere; academic projects, such as the Text En- coding Initiative, are more promising. 18) 6861 also includes a Glagolitic set, with characters as- signed to the same positions as their alphabetic equivalents in Cyrillic, a layout that facilitates transliterating between Glagolitic and Cyrillic. There seems to be some uncertainty in 6861 about how to distinguish differences in character sets from differences in typefaces, but the principle of not squandering the limited inventory of binary representations available in a Cyrillic set on Latin characters is sound.
tml@hemuli.atk.vtt.fi (Tor Lillqvist) (11/24/89)
In article <433@wjh12.harvard.edu> djb@wjh12.UUCP (David J. Birnbaum) writes: > Reducing the 33-character Russian alphabet to 32 is >desirable not only because one letter is orthographically >marginal, but because 32 is a convenient number for binary com- >puters and can facilitate case folding. Note, however, that the >Russian characters begin in an odd-numbered column, while the >Latin characters begin in an even-numbered one, which means that >Latin and Cyrillic case folding require different algorithms.(13) If we are considering future standards and trends, I think it is irrelevant that the traditional 7-bit ASCII seems to enable case folding by a simple addition/subtraction of a constant value. The same goes for ISO8859/1 and /5 (Latin 1 and Slavic (or whatever it's called)). Surely all software designed to follow local custom and typesetting rules must use more sphisticated table-driven case folding and collating algorithms. There are many obscure special cases in different languages. One could maybe even go as far as saying that it was a Bad Thing that ASCII was degined so that the letters are in (English) alphabetic order. If they had been in random order, some standard string case folding and comparison programming language interface would have been developed earlier. (Having said this, I must admit thay I use traditional strcmp, strlwr and <ctype.h> programming practice all the time, even though the HP-UX system I use has this NLS stuff.) -- Tor Lillqvist, VTT/ATK
sommar@enea.se (Erland Sommarskog) (11/26/89)
David J. Birnbaum (djb@wjh12.UUCP) criticized ISO 8859/5 in a long article in this newsgroup. I'm inclined to agree with him on many points. I wouldn't say I completely satisfied with the conecpt of Latin-1, Latin-2 etc. As a covering standard 6937 seems much more appealing. However, 8859 is here to stay for a while, and I think it's just to accept it as it is. After all 8859 is a lot better than ASCII alone. Mr. Birnbaum mainly focuses at the cyrillic set, but many of the problems he discusses concerns the latin sets as well. I will only cover one of the here, the one of collation order. > Character Order > > One advantage to following alphabetic order in character >coding is that it enables alphabetic sorting by comparing strings >according to machine order. This type of unfiltered sorting in >8859/5 is impossible for Ukrainian, Belorussian, Serbocroatian, >or Macedonian, since the characters from columns 10 and 15 would >have to be inserted into their proper places. This is a com- >pletely unnecessary limitation, because with one minor excep- >tion(15) all modern Slavic languages that use the Cyrillic al- >phabet follow a single order. Not all characters will occur in >each language, but a single order for the entire character set >would have made it possible to sort all languages in machine or- >der.(16) The truth is that a single enumeration doesn't apply at all for many languages. Dotted "A" and dotted "O" are separate letters in Swedish, but in German they are to be co-sorted with "A" and "O" or as "AE" and "OE". Same goes for accented letters in many languages. The conclusion of this is that software sorting packages are needed that can be customized to the desired with common languages pre- defined. Given this, it doesn't feel very important that the cyrrilic language would be honored a particular order. As some other poster, I think it was Tor Lillqvist, said, the best would have been if ASCII had taken the letter in random order. -- Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se
donn@hpfcdc.HP.COM (Donn Terry) (11/28/89)
Since I havn't seen it mentioned yet: IEEE 1003.2 (POSIX.2) which is currently in balloting addresses the issue of collation, case shift, case-independent comparison, etc. reasonably well. Clearly it handles issues such as collation order being distinct from codeset order, and it also handles at least some of the strangeness with German sharp-s and Spanish ch and ll. It's being refined at the moment, and it seems quite possible that it will handle any problem short of sorting words with the same spelling and different pronunciation sorting differently (which is NOT necessarily the worst problem). I don't yet know for sure whether it will handle what I am told Thai does: collate on first vowel. None of the problems in Birnbaum's paper seemed at all difficult for POSIX.2 internationalization to handle. Donn Terry HP Ft. Collins. (Oh yeah... and U.S. Internationalization rapporteur for SC22/WG15 among other silly titles.)