[comp.std.misc] The long-promised 'comments' on the Cyrillic part of CD 10646

DLV@CUNYVMS1.BITNET (07/12/90)

Ed Hart said:
>I could not locate the inverted "U" accent in the two rows (382) characters
>of Cyrillic in ISO/dp 10646 draft 2.

I have been promising various people to post my comments on CD 10646 for some
weeks, and never got around to it. I'll do so now. Since I use computers quite
a bit to process Russian texts, and sometimes give others free advice (for what
it's worth), I feel that these comments may be of interest to some. I'm cross-
posting this to several lists; please accept my apologies for the inevitable
duplication, especially if you are not interested in Cyrillic text processing.

I am very grateful to Johan van Wingen for telling me about the document
JTC1/SC2 N 2112, Summary of voting on DP 10646.2, Multiple-octet code and
helping me obtain attachment 12 (the official comment of the USSR national body
on the required Cyrillic character set), and for all the invaluable help he
provided in the past to me personally, the Russian TeX project, and to the
users' community at large. I also would like to thank David Birnbaum and Wayles
Browne for telling me things about processing Cyrillic texts in languages other
than modern Russian (I readily admit that I know practically nothing firsthand
about non-Russian Cyrillic text processing).

General remarks on 10646: I am unhappy about the absense of floating accents
and the lack of provision for registering italic / cursive variants of glyphs.
I know that handling floating / non-spacing accents in a pain in the neck; it
really is more a matter of religious beliefs. :) I believe in floating accents.

---

Obvious technical remarks about 2112 attachment 12:

032/032/039/112,113 is in the text, but not in the picture, 032/032/039/188,189
and 032/032/039/190,191 in the picture, but in in the text, a little mess-up
near 213.

Serbocroatian DJE and (Abkhazian et al) 'TE with scythe' are not the same
letter. They look very different! 032/032/039/216,217 should be 'TE with
scythe'; 032/032/039/218,219 should be not 'DJE with acute' but 'TE with scythe
and acute'.

The Khakassian letter 032/032/039/208,209 is not 'CHE with ogonek' but 'CHE
with cedilla on the left'. (Cyrillic cedilla, like the one on the left of
Cyrillic DE.)

---

Letter names do not confirm to the ISO rules (I've been told by JvW, and it is
also my impression:). I've sent the list of names of letters in 2112 that we've
come up with for the TeX project to JvW and Vinogradov; I'll be very pleased if
someone finds that paper useful.

I am not sure about proper English names for the following letters and letter
elements:

'Cyrillic cedilla' (the descender on DE, TSE, and SHCHA).

'Scythe'

'Rounded circumflex'

'Turkic Y-like YU'

The other heathen shapes seem to have been christened well.

---

Accented letters:

(Skip the following introduction if you already know it.) In regular Russian
text processing accented letters are seldom used. An acute accent over a vowel
indicates that this vowel is stressed. These accents are sometimes used in
books intended for children or students of the language; it is presumed that
those who knowthe language will know where the stress lies. HOWEVER, there are
a few cases where it is necessary to indicate the stress to avoid ambiguity.
       ,                                                  ,
E.g., bol'shaya chast' means "larger part", , while bol'shaya chast' simply
means "large part", and just bol'shaya chast' is ambiguous --- thus 8859-5
is really insufficient to handle such cases. (There are many other examples,
                 ,
e.g. chto vs. chto in sentences.)

The diaeresis over e is used to indicate that the letter is pronounced 'yo',
and indeed 8859-5 has 'yo' as a separate letter. One almost never needs to
comibine the diaeresis with an acute, because 'yo' is almost always stressed
--- the only obvious exceptions are words like "tryokhkolyosnyi" (the primary
stress is on the second 'yo' and you need 'e with diaeresis and acute' to
express this) and Hungarian geographical names transliterated in Cyrillic.
In regular texts (i.e., not those intended for students of the language) just
'e' is used, since the reader is expected to know when a 'e' becomes 'yo'.
However WB tells me that in Belorussian texts the use of \"{e} is mandatory.
Finally, in student texts the grave accent is sometimes used to indicate a
secondary stress is multi-syllable words (e.g., on the first \"{e} in the
example above).

2112 has most, but not all, vowel+acute combinations needed for Russian and
Ukrainian text processing. It lacks 'reverse e' with acute, (Ukrainian) Latin i
with acute and (Ukrainian) Latin i with both diaeresis and acute. 'Reverse e'
with acute is very important, because this letter is often used for
transliterating foreign words and names, and indicating the stress there may be
crucial.

With the addition of 'reverse e with acute' this inventory would be sufficient
for modern Russian. For processing 'student texts' it would be good to provide
complements with graves for all the letters present with acutes (+the 3 above).

Birnbaum and Browne have told me that in Bulgarian the grave accent is
traditionally used to indicate the primary accent, and there are situations
similar to Russian, where the indication of stress is required to avoid
ambiguity. In particular, 'i' is 'and' and 'i with grave' is the feminine
dative singular pronoun. Thus, the addition of vowels with grave seems to be a
requirement for Bulgarian, not just a 'nice to have' addition for Russian.

Browne has graciously sent me a paper that mentions, among other things, the
vowel-accent combinations needed for Serbo-Croatian. He said:

>Serbo-Croatian is normally written without accent marks. But some grammar
>books and dictionaries use the arks, and one occasionally uses a mark in
>ordinary texts. Since the language distinguishes long vs. short vowels, and
>also rising vs. falling pitch contour on accented vowels, the number of
>different marks requires is large.

('vowels' = Cyrillic a e i o r u (yes, r is a 'vowel' in SC)).

Short falling: 'vowels' with double grave accent
Short rising: 'vowels' with single grave accent
Long falling: 'vowels' with 'rounded circumflex'
Long rising: 'vowels' with acute accent
Long, but not accented: 'vowels' with macron

My understanding is that these diacritical marks are required for 'student
texts' intended to show students of the language how to pronounce
Serbo-Croatian words, but not required for other kinds of text processing. For
TeX, we will most likely support them all simply by providing the appropriate
floating accents (double grave and rounded circumflex).

---

Chracter inventory for the Cyrillic-based writing systems of the assorted
Soviet minorities other than Belorussians and Ukrainians (i.e., Abkhazian,
Azeri, ... Yakut):

For the RusTeX project at some point we collected the list of the letters
needed for processing text in these languages (that is, additional letters not
in ISO8859-5 and equivalents, used to denote special sounds in those languages;
e.g., {\cyr G} with horizontal stroke for representing Turkic 'gh' sound, the
Y-like letter representing the \"{u} sound in several Turkic languages, etc).
We used Gilyarevsky and Grivnin's book (recommended by JvW) and a few other
useful references.

In the DP 10646.2 that I received, the Cyrillic part had many strange and weird
glyphs which I could not identify after consulting G&G and other sources. (I'm
curious how this list of glyphs was composed in the first place!)

I was pleased to see that the Soviet part of 2112 was very similar to what we
had come up with (this may simply indicate that we used the same sources for
our research:). There are, however, 3 letters of this class present in our
proposed TeX typeface inventory and not present in 2112:

1 Ukrainian hard G
2 Uzbek KHA with ogonek (Q)
3 Uzbek KA with ogonek (H)

(Note: it is permissible to replace KA with ogonek and KHA with ogonek by their
versions with 'Cyrillic cedilla', and indeed this substitution can be found in
many Uzbek texts printed outside of Uzbekistan.)

In addition, 'KHA with breve' and 'Turkic Y-like yu with diaeresis' may be
required.

---

Early (Obsolete) letters

For my work I needed the 3 letters that were dropped form the Russian alphabet
in 1918: yat', fita, and izhitsa. (The 4th character of this group, 'latin i',
is present because it's used in Ukrainian.) I should explain that I am an
aspiring cryptologist and I have become concerned with certain aspects of
lexicography because of that. I once wrote a little PD hack for adding ISO
8859-5 code page to MS-DOS computers, which is, it seems, widely used; in that
hack, I added the letters yat', fita, and izhitsa, as well as the guillemets
(french quotes) in arbitrary positions in columns 8 and 9, making my files
somewhat non-standard. :)

The second draft of DP 10646 said:

>C.10 Bibliographic characters
>
>Bibliographic applications, and those of librarians and linguists require
>obsolete letters (especially for Cyrillic with pre-reformed orthography).

Indeed, it included the yat' character, but not the other two (much less common
than yat'). I was really disappointed to see that yat' has been deleted in
2112, and the other 2 letters were not added.

There are other, more obscure, obsolete Cyrillic letters required for
bibliographic and linguistic work. For example, the 'nasal' letters (Cyrillic
equivalents of Polish a and e with ogonek) were used in Bulgaria until 1945.

The manual: V.M.Andryushchenko, 'Kontseptsiya i arkhitektura mashinnogo fonda
russkogo yazyka', Moscow, Nauka, 1989 has on p. 187 an 'expanded Cyrillic
alphabet-2' (Rasshirennyj Kirillicheskij alfavit-2, RKA-2) which includes many
obsolete letters. The manual states that it's very important to code these
letters uniformly in a computer, so the text can be transmitted over computer
networks. While I personally don't expect to need obsolete letters other than
the above 3 in my work, I'm really surprised that the authors of 2112 did not
include the RKA-2 letters (including the required pre-accented variants) in
their proposal.

(Remark: RKA-2 is somewhat Russian-specific, and does not include a few
obsolete letters used in non-Russian Cyrillic in the past; hopefully, these
too will be present in the TeX typeface.)

Dimitri Vulis, 07/11/90