enag@ifi.uio.no (Erik Naggum) (04/26/91)
Gentlemen, I've become somewhat tired of reading comments of the "my character set standard has more characters than your character set standard" kind. The problem is not one of already included characters in any given character set standard in its draft stage, but how easily new characters can be added when needed, and how you address them. To paraphrase a saying from programming environments: There's always one more character. Unicode has the charming quality that each script is separated in the code table by a generous amount of unassigned character positions. There is also what the Unicode Consortium believes to be a generous amount of spare code points for other scripts. ISO DIS 10646 does not have this charming quality to the same extent, being much harder packed, but it has entire rows available for new scripts, depending on their size. If you don't like any of these, you can grab a private use row. There are entire planes available for special scripts with lots of characters (ideographic scripts). Private use planes also exist. The ability to subsume an industry standard such as Unicode into ISO DIS 10646 is eminently present. Indeed, ISO DIS 10646 can subsume anything. When or if we meet life in outer space, they'd probably appreciate one of the 190 remaining groups, too. Unicode has the charming quality that you can address any of the 65 536 possible characters with a constant sixteen bits. (I'm deliberatly glossing over the "what's a character, anyway" issue.) ISO DIS 10646 has mechanisms to address any of the 1 330 863 361 possible characters, but each with a varying number of bits, if you don't use the four-octet canonical form. Unicode is stateless in terms of what any given 16-bit binary value means. (Again, glossing over issues such as floating diacritics.) ISO DIS 10646 has numerous states due to the compaction methods, Single Graphic Character Introducer, and High Octet Preset mechanisms. Unicode works with a unit 16 bits wide. ISO DIS 10646 works with several units 8 bits wide. Unicode is subject to endianism. ISO DIS 10646 is octet stream based, and is not subject to endianism. These are technical differences which will have a much larger impact on the acceptance of each of these proposed standards than any number of included or excluded characters from each. There are a couple important aspects of each of these that also require attention, and a comparison with previous attempts at the same have not generally fared well: Unicode employs floating diacritics for scripts which do not separate the diacritic and the character to which it applies. This was tried out with ISO 6937/2, a standard which is used mainly for reference purposes and in some specific applications for which it was created. ISO DIS 10646 employs code shifting in various ways, analogous to ISO 2022, ISO 4873 (num?) and others. This has generally posed problems for programmers who would like a one-to-one relationship between character and bit-string. Unicode caters to programmers in its fixed width, and to typographers and bibliography needs with floating diacritics, but these two issues tend to be contradictory on several levels. ISO DIS 10646 caters to national and international standards, and their procedures, which will ensure that formal agreement on a good standard will be easier and that revisions will be few and far (in time) between. (This "good" may not map to your "good", and I'm not going to fight over that.) These issues are relevant in the questions on agreement and acceptance by industry and systems developers, and become especially delicate when we consider government requirements. Governments tend to choose International Standards over industry standards (partly because to appear to give particular vendors an advantage places them in an uncomfortable light), and the European Community politicians are getting more and more power over what is and is not going to be part of Europe as we have yet to know it. I'd like to see some discussion on these topics, instead of the useless quibbling over which character set does or does not have "FOOTWEAR CAPITAL LETTER SWOOSH WITH AIR BELOW" or any other favorite "required" character. -- [Erik Naggum] <enag@ifi.uio.no> Naggum Software, Oslo, Norway <erik@naggum.uu.no>
kkim@plains.NoDak.edu (kyongsok kim) (04/27/91)
(Erik Naggum) writes:
:Unicode is subject to endianism.
could anyone please explain what "endianism" is?
:Unicode employs floating diacritics for scripts which do not separate
:the diacritic and the character to which it applies.
most people in favor of iso 10646 attack floating diacritics. how do
floating diacritics and non-spacing characters (which i believe iso 10646
adopts) differ? from end-users' point of view, these two seem one and
the same. am i missing something?
:[Erik Naggum] <enag@ifi.uio.no>
:Naggum Software, Oslo, Norway <erik@naggum.uu.no>
k kim
enag@ifi.uio.no (Erik Naggum) (05/04/91)
In article <10003@plains.NoDak.edu> kkim@plains.NoDak.edu (kyongsok kim) writes: (Erik Naggum) writes: :Unicode is subject to endianism. could anyone please explain what "endianism" is? Sorry for using an unwarrantedly technical term outside its original domain. Computers whose smallest addressable unit of information is the octet (byte) need some ordering scheme for the octets to make up units consisting of more than one octet, such as a 16-bit quantity, or a 32-bit quantity. There are basically two ways to do this, with variations over the theme, called "big-endian" and "little-endian". Big-endian octet order means that the "big end" (most significant octet) comes first, and conversely for little-endian octet order. By way of example, consider the octet order for the 16-bit quantity U+0040 (the commercial at-sign in Unicode). A big-endian hardware would represent this as +----+----+ | 00 | 40 | +----+----+ (reading memory from low addresses at left to high addresses at right), while a little-endian hardware would represent the same numeric quantity or Unicode character as +----+----+ | 40 | 00 | +----+----+ What I mean by "endianism", then, is the whole issue around the portability of binary coded information when the order of larger-than- octet units are moved around one octet at a time. E.g. if a little- endian machine writes a U+0040 to a file, it will be read as whatever U+4000 is in Unicode on a big-endian machine, and exactly the same the other way around. It should be clear that interoperability will lose significantly through this scheme, and if a choice is made, machines who have made the other choice will hit a severe performance penalty. :Unicode employs floating diacritics for scripts which do not separate :the diacritic and the character to which it applies. most people in favor of iso 10646 attack floating diacritics. how do floating diacritics and non-spacing characters (which i believe iso 10646 adopts) differ? from end-users' point of view, these two seem one and the same. am i missing something? Consider the Norwegian and French words for a small restaurant, spelled "cafe'" (where the ' serves as a floating acute accent for rendering purposes in the absence of an international character set standard in which we wouldn't need it :-). In Norwegian, the acute accent over e is optional, it's an ornament to indicate stress, toneme, etc. It's not orthographically required. In French, an e with acute is a different orthographic unit than plain, unadorned e. This means that in Norwegian, we can make do with a floating acute accent, since the function of the acute accent is to modify the character with which is combined. In French, however, they cannot make do with a floating acute accent because the acute accent does not have a function by itself. Rather, the unit is "e with acute". Then there's the Norwegian character "a with ring above", in which the ring above has exactly the same nature as the acute accent in French. If Norwegian was supposed to be written with "a*" (* substituting for a non-existent non-spacing floating "ring above"), it would complicate things for us to the point where we would have to vote a strong NO to a standard forcing us to do this. (Note that we can't vote against Unicode, we can only "fail to adopt it".) Of course, French and Norwegian are sufficiently important languages that we've had all our characters represented in ISO 8859-1 (with the possible exception of the French political faux pas with respect to the "oe" ligature). Some minority languages are less well off, to put it mildly. I've heard that East European languages employing a heavily diacriticized Cyrillic script are suffering from the lack of characters for their needs, and think that floating diacritics is the answer to their problem. So, to summarize, a diacritic mark may or may not be an integral part of a character depending on orthographic conventions in the language in question. To treat a diacritic as floating when it is an integral part of a character would be wrong, as would insisting on having all possible combinations of a truly floating diacritic and the characters with which it may be combined coded separately. Now, ISO DIS 10646 is of the "insist on all combinations" persuasion, but has non-spacing characters for languages in which the "separate unit of information" is eminently the case (e.g. Hebrew). I've come to learn that this is overly restrictive in many, many cases. Unicode allows a large number of floating diacritical marks in languages which I don't have a shred of competence to make comments, but several people have expressed the opinion that they're not really floating for several languages. Without a firm ruling in the standard or national standards on the nature of the diacritical marks from orthographical conventions employed, there is an annoying ambiguity between "cafe'" and "caf*" (* now substituting for e with acute accent). Is the * really an e plus an ', or is a separate character, or is it vice versa? As noted above, the answer is different from French and Norwegian, although the word is exactly the same! The other problem with floating diacritics is that the number of characters is not naturally bounded, a thought at which ISO understandable shudders. Unicode talks about bounding the displayable number of characters (with diacritical marks) through extra-standard means, while ISO wants do it with intra-standard means. For instance, a commercial at-sign with acute accent and cedilla below doesn't make much sense. What should a Unicode display device do with that sequence of characters? I am deeply indebted to Professor David Birnbaum for explaining this to me in much detail, and I'm of course responsible for any mistakes. Hope this has helped. -- [Erik Naggum] Professional Programmer <enag@ifi.uio.no> Naggum Software Electronic Text <erik@naggum.uu.no> 0118 OSLO, NORWAY Computer Communications +47-2-836-863
hpa@casbah.acns.nwu.edu (H. Peter Anvin) (05/04/91)
In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes: >Unicode allows a large number of floating diacritical marks in >languages which I don't have a shred of competence to make comments, >but several people have expressed the opinion that they're not really >floating for several languages. Yes, UNICODE does not care which language we are dealing with; note too that one may have to combine characters from several sections of the UNICODE in order to form a complete script. The question then becomes: so what? If we insist on having diacritics that float for the languages that have them possible and fixed for the languages that require them, someday someone will type "e'" with a fixed diacritic while writing Norwegian, or a floting in French, just to have something break for them. As I understand it, UNICODE only has non-floating diacritics for historical (compatibility) reasons. For example, "e'" is U+009E only for compatibility with Latin-1, while the explicit coding is U+0065 U+0301. I take it that at U+009E there will just be an alias entry referring to U+0065 U+0301. >The other problem with floating diacritics is that the number of >characters is not naturally bounded, a thought at which ISO >understandable shudders. Unicode talks about bounding the displayable >number of characters (with diacritical marks) through extra-standard >means, while ISO wants do it with intra-standard means. For instance, >a commercial at-sign with acute accent and cedilla below doesn't make >much sense. What should a Unicode display device do with that >sequence of characters? In my opinion, it should take the @ sign and superimpose an acute accent and tack a cedilla at the bottom. A high-quality output device will probably have a set of pre-finished combinations, but that doesn't prevent it from using plain old superposition (or fancied-up superposition) as a default solution. After all, the combination tells it what it should look like, right? Endianism is a tricky question, but in most cases there is precedent. For telecommunication, both CCITT and Internet standards advocate bigendianism (Motorola style). Check out what the sequence of bits are out of a V.24/RS-232 port. Bigendian. Thus that is probably the preferred style for interchange. For word processors etc. there are usually numeric fields which have had to be resolved; mostly as the style dominant on the machine it was introduced on. [P.S. As a programmer, I prefer littleendian (Intel) style; while a bigendian hex dump is easier to read, littleendianism avoid many of the problems with different variable sizes. D.S.] I also think there should be a recommended mangling scheme for converting Unitext to ASCII text spectrum (NOT octet spectrum) for purpouses like Internet mail, which not is very likely to change any time soon. I have given the question some thought but I am not going to say anything until I have figured out a "safe" way that could also distinguish between Unitext and ASCII text. /Peter -- IDENTITY: Anvin, H. Peter STATUS: Student INTERNET: hpa@casbah.acns.nwu.edu FIDONET: 1:115/989.4 HAM RADIO: N9ITP, SM4TKN RBBSNET: 8:970/101.4 EDITOR OF: The Stillwaters BBS List TEACHING: Swedish
ck@voa3.VOA.GOV (Chris Kern) (05/05/91)
In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes: >Consider the Norwegian and French words for a small restaurant, >spelled "cafe'" (where the ' serves as a floating acute accent for >rendering purposes in the absence of an international character set >standard in which we wouldn't need it :-). In Norwegian, the acute >accent over e is optional, it's an ornament to indicate stress, >toneme, etc. It's not orthographically required. In French, an e >with acute is a different orthographic unit than plain, unadorned e. > >This means that in Norwegian, we can make do with a floating acute >accent, since the function of the acute accent is to modify the >character with which is combined. In French, however, they cannot >make do with a floating acute accent because the acute accent does not >have a function by itself. Rather, the unit is "e with acute". I confess that I don't understand the problem. Regardless of the attributes of the underlying language, is there some reason why I should care whether a character-diacritic combination is stored as one code or two as long as (a) its image is properly rendered when I need to look at it and (b) a program which consumes a text stream that includes such (character-diacritic) combinations can unambiguously determine its content? (Of course, if I can meet requirement "b" presumably I can meet requirement "a" as well.) -- Chris Kern ck@voa3.voa.gov ...uunet!voa3!ck +1 202-619-2020
djbpitt@unix.cis.pitt.edu (David J Birnbaum) (05/05/91)
In article <1991May4.180549.29162@voa3.VOA.GOV> ck@voa3.VOA.GOV (Chris Kern) writes: >I confess that I don't understand the problem. Regardless of the >attributes of the underlying language, is there some reason why I >should care whether a character-diacritic combination is stored as >one code or two as long as (a) its image is properly rendered when >I need to look at it and (b) a program which consumes a text stream >that includes such (character-diacritic) combinations can >unambiguously determine its content? Yes and no. One could encode English logographically, but we don't do it because (among other things) people don't process English text logographically; they do it by character. Similarly, we can encode Hebrew consonant plus vertically aligned vowel points and cantilla- tion marks as single characters, but people don't work with Hebrew text this way. One practical consequence of encoding vowel+accentual_diacritic variously is the way it affects natural classes. I can search for all words with long rising accents in Serbocroatian (graphically an acute) more easily if the acute is a separate character. I would not want to conduct such a search in French, where letters with acute do not constitute a natural class (i.e., where "acute" does not have an independent meaning), but this is not an unnatural type of search to make in Serbocroatian, comparable to searching for all words with any other letter. As another example, I can strip the (orthographically optional) accents from a Serbocroatian text more efficiently by searching and deleting the five accentual diacritics than by searching and replacing each accented vowel by its unaccented counterpart. Again, this is not something one would normally want to do for French. One other issue is efficient use of character cells. If there is a small number of vowels and a small number of accent marks (to use a common example and imprecise terminology), there isn't a lot at stake. But take a system with lots of vowel letters, lots of accent marks, the possibility of multiple accent marks on a single vowel, and you're talking about a lot of character cells if each one is to be treated as an indivisible unit. And it is writing systems where accent marks are productive units that are combined ad hoc with a natural class of letters (such as vowels) that have this large number of combinations. At a certain level, the answer to your question is that it doesn't matter. This seems to be why 10646 and Unicode have been able to take opposing positions on the issue; both are concerned with form, rather than function, and anything that arrives at the correct form fulfills the minimal requirements. But there are plenty of writing systems that aren't like English or like French and where you can only support the full inventory of complex combinations either by storing the combination as a sequence or by dedicating an extremely large number of character cells. For orthographies like this, the former is more efficient and corresponds more directly to the types of operations that users may want to perform on the text. --David ======================================================================= Professor David J. Birnbaum djbpitt@vms.cis.pitt.edu [Internet] The Royal York Apartments, #802 djbpitt@pittvms.bitnet [Bitnet] 3955 Bigelow Boulevard voice: 1-412-687-4653 Pittsburgh, PA 15123 USA fax: 1-412-624-9714
ccplumb@rose.waterloo.edu (Colin Plumb) (05/05/91)
I'm not a great linguist (English, French, and German), but I also like separate accents because it's so much easier to accomodate wierd uses. Mathematicians put funny accents over and under every letter in creation. Ever played with rho-hat? Linguists and phoneticists may do the same. And it's such a bother enumerating all the legal possibilities. There's a CCITT standard which I can't seem to locate right now that uses non-spacing accents, and it seems like the right thing to me. Yess, e-acute is conceptually one thing in French, but qu and ph are pretty distinct entities in English, and I can't say for sure how different o-umlaut is from oe in German. Mc and Mac have been special-cased in many places in English (the correct all-caps spelling of McDonald's is McDONALD'S), with superscript c's being common. It's pretty impossible to come up with a character standard that only lets you do sensible things. All I can suggest is, don't do the senseless ones. Treat accented characters as double-byte characters (recognizable by the first byte) if the accents are inseparable, but don't if they can be logically separated. The CCITT standard also specifies a subset of the possible combinations that are required to be displayable, but the usual cheap implementation is probably accents plus some sort of character-height information, while a higher-rent scheme uses some dedicated pairs, with fallback to the former. Separate accents makes the low-cost scheme much easier, without seriously hampering the higher-cost one. Good typesetting systems already handle ligatures and kerning as is. -- -Colin
kkim@plains.NoDak.edu (kyongsok kim) (05/05/91)
In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:
:Now, ISO DIS 10646 is of the "insist on all combinations" persuasion,
^^^^^^^^^^^^^^^^
Is it explicityly specified in any document or just an implicitly accepted
principle?
:but has non-spacing characters for languages in which the "separate
:unit of information" is eminently the case (e.g. Hebrew). I've come
:to learn that this is overly restrictive in many, many cases.
^^^^^^^^^^^
Could you please explain this in more detail? I am a little bit
confused. Do you mean than "insisting on all combinations" is too
restrictive and therefore somewhat unreasonable for some languages?
Or somehting else?
--------------------------------------------
I will give one example showing the "all combinations" principle is not
applicable in at least one case. In case of Ancient Hangul, nobody knows
exactly what combinations of characters (i.e., syllables) were used in the
past, although component letters (or characters) of the syllables are
completely known. Every time a scholar finds a new syllable, does he/she
have to report it to a national standards body, which will again report
it to ISO? The scholar may not be able to represent and send that
character until the national standards body modifies its standard and
then ISO modifies 10646. How long will it take? If ISO simply drops the
"all combinations" principle (as with Hebrew), the whole problem can be
solved immediately. (The solution is already known!)
I am still wondering whether Hebrew is the only script in 10646 not
honoring the ISO's "all combinations" principle. I tried to figure it out
but no luck yet. There seem several scripts such as Devanagari and
several scripts used in India, Arabic and its several variants, Thai,
Laos, etc. which will have similar properties.
:[Erik Naggum] Professional Programmer <enag@ifi.uio.no>
:Naggum Software Electronic Text <erik@naggum.uu.no>
:0118 OSLO, NORWAY Computer Communications +47-2-836-863
Kyongsok Kim
Dept. of Comp. Sci., North Dakota State University
e-mail: kkim@plains.nodak.edu; kkim@plains.bitnet; ...!uunet!plains!kkim
peter@ficc.ferranti.com (Peter da Silva) (05/07/91)
In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes: > a commercial at-sign with acute accent and cedilla below doesn't make > much sense. What should a Unicode display device do with that > sequence of characters? , It better display it as @ (more or less), because someone's gonna use it. ' If you don't believe that, then consider the use of "!#%^&*|" in C. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"