DLV@CUNYVMS1.BITNET (07/12/90)
Ed Hart said: >I could not locate the inverted "U" accent in the two rows (382) characters >of Cyrillic in ISO/dp 10646 draft 2. I have been promising various people to post my comments on CD 10646 for some weeks, and never got around to it. I'll do so now. Since I use computers quite a bit to process Russian texts, and sometimes give others free advice (for what it's worth), I feel that these comments may be of interest to some. I'm cross- posting this to several lists; please accept my apologies for the inevitable duplication, especially if you are not interested in Cyrillic text processing. I am very grateful to Johan van Wingen for telling me about the document JTC1/SC2 N 2112, Summary of voting on DP 10646.2, Multiple-octet code and helping me obtain attachment 12 (the official comment of the USSR national body on the required Cyrillic character set), and for all the invaluable help he provided in the past to me personally, the Russian TeX project, and to the users' community at large. I also would like to thank David Birnbaum and Wayles Browne for telling me things about processing Cyrillic texts in languages other than modern Russian (I readily admit that I know practically nothing firsthand about non-Russian Cyrillic text processing). General remarks on 10646: I am unhappy about the absense of floating accents and the lack of provision for registering italic / cursive variants of glyphs. I know that handling floating / non-spacing accents in a pain in the neck; it really is more a matter of religious beliefs. :) I believe in floating accents. --- Obvious technical remarks about 2112 attachment 12: 032/032/039/112,113 is in the text, but not in the picture, 032/032/039/188,189 and 032/032/039/190,191 in the picture, but in in the text, a little mess-up near 213. Serbocroatian DJE and (Abkhazian et al) 'TE with scythe' are not the same letter. They look very different! 032/032/039/216,217 should be 'TE with scythe'; 032/032/039/218,219 should be not 'DJE with acute' but 'TE with scythe and acute'. The Khakassian letter 032/032/039/208,209 is not 'CHE with ogonek' but 'CHE with cedilla on the left'. (Cyrillic cedilla, like the one on the left of Cyrillic DE.) --- Letter names do not confirm to the ISO rules (I've been told by JvW, and it is also my impression:). I've sent the list of names of letters in 2112 that we've come up with for the TeX project to JvW and Vinogradov; I'll be very pleased if someone finds that paper useful. I am not sure about proper English names for the following letters and letter elements: 'Cyrillic cedilla' (the descender on DE, TSE, and SHCHA). 'Scythe' 'Rounded circumflex' 'Turkic Y-like YU' The other heathen shapes seem to have been christened well. --- Accented letters: (Skip the following introduction if you already know it.) In regular Russian text processing accented letters are seldom used. An acute accent over a vowel indicates that this vowel is stressed. These accents are sometimes used in books intended for children or students of the language; it is presumed that those who knowthe language will know where the stress lies. HOWEVER, there are a few cases where it is necessary to indicate the stress to avoid ambiguity. , , E.g., bol'shaya chast' means "larger part", , while bol'shaya chast' simply means "large part", and just bol'shaya chast' is ambiguous --- thus 8859-5 is really insufficient to handle such cases. (There are many other examples, , e.g. chto vs. chto in sentences.) The diaeresis over e is used to indicate that the letter is pronounced 'yo', and indeed 8859-5 has 'yo' as a separate letter. One almost never needs to comibine the diaeresis with an acute, because 'yo' is almost always stressed --- the only obvious exceptions are words like "tryokhkolyosnyi" (the primary stress is on the second 'yo' and you need 'e with diaeresis and acute' to express this) and Hungarian geographical names transliterated in Cyrillic. In regular texts (i.e., not those intended for students of the language) just 'e' is used, since the reader is expected to know when a 'e' becomes 'yo'. However WB tells me that in Belorussian texts the use of \"{e} is mandatory. Finally, in student texts the grave accent is sometimes used to indicate a secondary stress is multi-syllable words (e.g., on the first \"{e} in the example above). 2112 has most, but not all, vowel+acute combinations needed for Russian and Ukrainian text processing. It lacks 'reverse e' with acute, (Ukrainian) Latin i with acute and (Ukrainian) Latin i with both diaeresis and acute. 'Reverse e' with acute is very important, because this letter is often used for transliterating foreign words and names, and indicating the stress there may be crucial. With the addition of 'reverse e with acute' this inventory would be sufficient for modern Russian. For processing 'student texts' it would be good to provide complements with graves for all the letters present with acutes (+the 3 above). Birnbaum and Browne have told me that in Bulgarian the grave accent is traditionally used to indicate the primary accent, and there are situations similar to Russian, where the indication of stress is required to avoid ambiguity. In particular, 'i' is 'and' and 'i with grave' is the feminine dative singular pronoun. Thus, the addition of vowels with grave seems to be a requirement for Bulgarian, not just a 'nice to have' addition for Russian. Browne has graciously sent me a paper that mentions, among other things, the vowel-accent combinations needed for Serbo-Croatian. He said: >Serbo-Croatian is normally written without accent marks. But some grammar >books and dictionaries use the arks, and one occasionally uses a mark in >ordinary texts. Since the language distinguishes long vs. short vowels, and >also rising vs. falling pitch contour on accented vowels, the number of >different marks requires is large. ('vowels' = Cyrillic a e i o r u (yes, r is a 'vowel' in SC)). Short falling: 'vowels' with double grave accent Short rising: 'vowels' with single grave accent Long falling: 'vowels' with 'rounded circumflex' Long rising: 'vowels' with acute accent Long, but not accented: 'vowels' with macron My understanding is that these diacritical marks are required for 'student texts' intended to show students of the language how to pronounce Serbo-Croatian words, but not required for other kinds of text processing. For TeX, we will most likely support them all simply by providing the appropriate floating accents (double grave and rounded circumflex). --- Chracter inventory for the Cyrillic-based writing systems of the assorted Soviet minorities other than Belorussians and Ukrainians (i.e., Abkhazian, Azeri, ... Yakut): For the RusTeX project at some point we collected the list of the letters needed for processing text in these languages (that is, additional letters not in ISO8859-5 and equivalents, used to denote special sounds in those languages; e.g., {\cyr G} with horizontal stroke for representing Turkic 'gh' sound, the Y-like letter representing the \"{u} sound in several Turkic languages, etc). We used Gilyarevsky and Grivnin's book (recommended by JvW) and a few other useful references. In the DP 10646.2 that I received, the Cyrillic part had many strange and weird glyphs which I could not identify after consulting G&G and other sources. (I'm curious how this list of glyphs was composed in the first place!) I was pleased to see that the Soviet part of 2112 was very similar to what we had come up with (this may simply indicate that we used the same sources for our research:). There are, however, 3 letters of this class present in our proposed TeX typeface inventory and not present in 2112: 1 Ukrainian hard G 2 Uzbek KHA with ogonek (Q) 3 Uzbek KA with ogonek (H) (Note: it is permissible to replace KA with ogonek and KHA with ogonek by their versions with 'Cyrillic cedilla', and indeed this substitution can be found in many Uzbek texts printed outside of Uzbekistan.) In addition, 'KHA with breve' and 'Turkic Y-like yu with diaeresis' may be required. --- Early (Obsolete) letters For my work I needed the 3 letters that were dropped form the Russian alphabet in 1918: yat', fita, and izhitsa. (The 4th character of this group, 'latin i', is present because it's used in Ukrainian.) I should explain that I am an aspiring cryptologist and I have become concerned with certain aspects of lexicography because of that. I once wrote a little PD hack for adding ISO 8859-5 code page to MS-DOS computers, which is, it seems, widely used; in that hack, I added the letters yat', fita, and izhitsa, as well as the guillemets (french quotes) in arbitrary positions in columns 8 and 9, making my files somewhat non-standard. :) The second draft of DP 10646 said: >C.10 Bibliographic characters > >Bibliographic applications, and those of librarians and linguists require >obsolete letters (especially for Cyrillic with pre-reformed orthography). Indeed, it included the yat' character, but not the other two (much less common than yat'). I was really disappointed to see that yat' has been deleted in 2112, and the other 2 letters were not added. There are other, more obscure, obsolete Cyrillic letters required for bibliographic and linguistic work. For example, the 'nasal' letters (Cyrillic equivalents of Polish a and e with ogonek) were used in Bulgaria until 1945. The manual: V.M.Andryushchenko, 'Kontseptsiya i arkhitektura mashinnogo fonda russkogo yazyka', Moscow, Nauka, 1989 has on p. 187 an 'expanded Cyrillic alphabet-2' (Rasshirennyj Kirillicheskij alfavit-2, RKA-2) which includes many obsolete letters. The manual states that it's very important to code these letters uniformly in a computer, so the text can be transmitted over computer networks. While I personally don't expect to need obsolete letters other than the above 3 in my work, I'm really surprised that the authors of 2112 did not include the RKA-2 letters (including the required pre-accented variants) in their proposal. (Remark: RKA-2 is somewhat Russian-specific, and does not include a few obsolete letters used in non-Russian Cyrillic in the past; hopefully, these too will be present in the TeX typeface.) Dimitri Vulis, 07/11/90