[comp.std.internat] Unicode vs ISO DIS 10646

enag@ifi.uio.no (Erik Naggum) (04/26/91)

Gentlemen,

I've become somewhat tired of reading comments of the "my character
set standard has more characters than your character set standard"
kind.

The problem is not one of already included characters in any given
character set standard in its draft stage, but how easily new
characters can be added when needed, and how you address them.  To
paraphrase a saying from programming environments:

	There's always one more character.

Unicode has the charming quality that each script is separated in the
code table by a generous amount of unassigned character positions.
There is also what the Unicode Consortium believes to be a generous
amount of spare code points for other scripts.

ISO DIS 10646 does not have this charming quality to the same extent,
being much harder packed, but it has entire rows available for new
scripts, depending on their size.  If you don't like any of these, you
can grab a private use row.  There are entire planes available for
special scripts with lots of characters (ideographic scripts).
Private use planes also exist.  The ability to subsume an industry
standard such as Unicode into ISO DIS 10646 is eminently present.
Indeed, ISO DIS 10646 can subsume anything.  When or if we meet life
in outer space, they'd probably appreciate one of the 190 remaining
groups, too.

Unicode has the charming quality that you can address any of the
65 536 possible characters with a constant sixteen bits.  (I'm
deliberatly glossing over the "what's a character, anyway" issue.)

ISO DIS 10646 has mechanisms to address any of the 1 330 863 361
possible characters, but each with a varying number of bits, if you
don't use the four-octet canonical form.

Unicode is stateless in terms of what any given 16-bit binary value
means.  (Again, glossing over issues such as floating diacritics.)

ISO DIS 10646 has numerous states due to the compaction methods,
Single Graphic Character Introducer, and High Octet Preset mechanisms.

Unicode works with a unit 16 bits wide.

ISO DIS 10646 works with several units 8 bits wide.

Unicode is subject to endianism.

ISO DIS 10646 is octet stream based, and is not subject to endianism.

These are technical differences which will have a much larger impact
on the acceptance of each of these proposed standards than any number
of included or excluded characters from each.

There are a couple important aspects of each of these that also
require attention, and a comparison with previous attempts at the same
have not generally fared well:

Unicode employs floating diacritics for scripts which do not separate
the diacritic and the character to which it applies.  This was tried
out with ISO 6937/2, a standard which is used mainly for reference
purposes and in some specific applications for which it was created.

ISO DIS 10646 employs code shifting in various ways, analogous to ISO
2022, ISO 4873 (num?) and others.  This has generally posed problems
for programmers who would like a one-to-one relationship between
character and bit-string.

Unicode caters to programmers in its fixed width, and to typographers
and bibliography needs with floating diacritics, but these two issues
tend to be contradictory on several levels.

ISO DIS 10646 caters to national and international standards, and
their procedures, which will ensure that formal agreement on a good
standard will be easier and that revisions will be few and far (in
time) between.  (This "good" may not map to your "good", and I'm not
going to fight over that.)

These issues are relevant in the questions on agreement and acceptance
by industry and systems developers, and become especially delicate
when we consider government requirements.  Governments tend to choose
International Standards over industry standards (partly because to
appear to give particular vendors an advantage places them in an
uncomfortable light), and the European Community politicians are
getting more and more power over what is and is not going to be part
of Europe as we have yet to know it.

I'd like to see some discussion on these topics, instead of the
useless quibbling over which character set does or does not have
"FOOTWEAR CAPITAL LETTER SWOOSH WITH AIR BELOW" or any other favorite
"required" character.

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

kkim@plains.NoDak.edu (kyongsok kim) (04/27/91)

(Erik Naggum) writes:

:Unicode is subject to endianism.

could anyone please explain what "endianism" is?

:Unicode employs floating diacritics for scripts which do not separate
:the diacritic and the character to which it applies.

most people in favor of iso 10646 attack floating diacritics.  how do
floating diacritics and non-spacing characters (which i believe iso 10646
adopts) differ?  from end-users' point of view, these two seem one and
the same.  am i missing something?

:[Erik Naggum]					     <enag@ifi.uio.no>
:Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>

k kim

enag@ifi.uio.no (Erik Naggum) (05/04/91)

In article <10003@plains.NoDak.edu> kkim@plains.NoDak.edu (kyongsok kim) writes:

   (Erik Naggum) writes:

   :Unicode is subject to endianism.

   could anyone please explain what "endianism" is?

Sorry for using an unwarrantedly technical term outside its original
domain.  Computers whose smallest addressable unit of information is
the octet (byte) need some ordering scheme for the octets to make up
units consisting of more than one octet, such as a 16-bit quantity, or
a 32-bit quantity.  There are basically two ways to do this, with
variations over the theme, called "big-endian" and "little-endian".

Big-endian octet order means that the "big end" (most significant
octet) comes first, and conversely for little-endian octet order.  By
way of example, consider the octet order for the 16-bit quantity
U+0040 (the commercial at-sign in Unicode).  A big-endian hardware
would represent this as

	+----+----+
	| 00 | 40 |
	+----+----+

(reading memory from low addresses at left to high addresses at
right), while a little-endian hardware would represent the same
numeric quantity or Unicode character as

	+----+----+
	| 40 | 00 |
	+----+----+

What I mean by "endianism", then, is the whole issue around the
portability of binary coded information when the order of larger-than-
octet units are moved around one octet at a time.  E.g. if a little-
endian machine writes a U+0040 to a file, it will be read as whatever
U+4000 is in Unicode on a big-endian machine, and exactly the same the
other way around.  It should be clear that interoperability will lose
significantly through this scheme, and if a choice is made, machines
who have made the other choice will hit a severe performance penalty.

   :Unicode employs floating diacritics for scripts which do not separate
   :the diacritic and the character to which it applies.

   most people in favor of iso 10646 attack floating diacritics.  how
   do floating diacritics and non-spacing characters (which i believe
   iso 10646 adopts) differ?  from end-users' point of view, these two
   seem one and the same.  am i missing something?

Consider the Norwegian and French words for a small restaurant,
spelled "cafe'" (where the ' serves as a floating acute accent for
rendering purposes in the absence of an international character set
standard in which we wouldn't need it :-).  In Norwegian, the acute
accent over e is optional, it's an ornament to indicate stress,
toneme, etc.  It's not orthographically required.  In French, an e
with acute is a different orthographic unit than plain, unadorned e.

This means that in Norwegian, we can make do with a floating acute
accent, since the function of the acute accent is to modify the
character with which is combined.  In French, however, they cannot
make do with a floating acute accent because the acute accent does not
have a function by itself.  Rather, the unit is "e with acute".

Then there's the Norwegian character "a with ring above", in which the
ring above has exactly the same nature as the acute accent in French.
If Norwegian was supposed to be written with "a*" (* substituting for
a non-existent non-spacing floating "ring above"), it would complicate
things for us to the point where we would have to vote a strong NO to
a standard forcing us to do this.  (Note that we can't vote against
Unicode, we can only "fail to adopt it".)

Of course, French and Norwegian are sufficiently important languages
that we've had all our characters represented in ISO 8859-1 (with the
possible exception of the French political faux pas with respect to
the "oe" ligature).  Some minority languages are less well off, to put
it mildly.  I've heard that East European languages employing a
heavily diacriticized Cyrillic script are suffering from the lack of
characters for their needs, and think that floating diacritics is the
answer to their problem.

So, to summarize, a diacritic mark may or may not be an integral part
of a character depending on orthographic conventions in the language
in question.  To treat a diacritic as floating when it is an integral
part of a character would be wrong, as would insisting on having all
possible combinations of a truly floating diacritic and the characters
with which it may be combined coded separately.

Now, ISO DIS 10646 is of the "insist on all combinations" persuasion,
but has non-spacing characters for languages in which the "separate
unit of information" is eminently the case (e.g. Hebrew).  I've come
to learn that this is overly restrictive in many, many cases.

Unicode allows a large number of floating diacritical marks in
languages which I don't have a shred of competence to make comments,
but several people have expressed the opinion that they're not really
floating for several languages.

Without a firm ruling in the standard or national standards on the
nature of the diacritical marks from orthographical conventions
employed, there is an annoying ambiguity between "cafe'" and "caf*" (*
now substituting for e with acute accent).  Is the * really an e plus
an ', or is a separate character, or is it vice versa?  As noted
above, the answer is different from French and Norwegian, although the
word is exactly the same!

The other problem with floating diacritics is that the number of
characters is not naturally bounded, a thought at which ISO
understandable shudders.  Unicode talks about bounding the displayable
number of characters (with diacritical marks) through extra-standard
means, while ISO wants do it with intra-standard means.  For instance,
a commercial at-sign with acute accent and cedilla below doesn't make
much sense.  What should a Unicode display device do with that
sequence of characters?

I am deeply indebted to Professor David Birnbaum for explaining this
to me in much detail, and I'm of course responsible for any mistakes.

Hope this has helped.

--
[Erik Naggum]           Professional Programmer        <enag@ifi.uio.no>
Naggum Software             Electronic Text          <erik@naggum.uu.no>
0118 OSLO, NORWAY       Computer Communications            +47-2-836-863

hpa@casbah.acns.nwu.edu (H. Peter Anvin) (05/04/91)

In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:
>Unicode allows a large number of floating diacritical marks in
>languages which I don't have a shred of competence to make comments,
>but several people have expressed the opinion that they're not really
>floating for several languages.

Yes, UNICODE does not care which language we are dealing with; note too
that one may have to combine characters from several sections of the
UNICODE in order to form a complete script.  The question then becomes: so
what?  If we insist on having diacritics that float for the languages that
have them possible and fixed for the languages that require them, someday
someone will type "e'" with a fixed diacritic while writing Norwegian, or a
floting in French, just to have something break for them.  As I understand
it, UNICODE only has non-floating diacritics for historical (compatibility)
reasons.  For example, "e'" is U+009E only for compatibility with Latin-1,
while the explicit coding is U+0065 U+0301.  I take it that at U+009E there
will just be an alias entry referring to U+0065 U+0301.

>The other problem with floating diacritics is that the number of
>characters is not naturally bounded, a thought at which ISO
>understandable shudders.  Unicode talks about bounding the displayable
>number of characters (with diacritical marks) through extra-standard
>means, while ISO wants do it with intra-standard means.  For instance,
>a commercial at-sign with acute accent and cedilla below doesn't make
>much sense.  What should a Unicode display device do with that
>sequence of characters?

In my opinion, it should take the @ sign and superimpose an acute accent
and tack a cedilla at the bottom.  A high-quality output device will
probably have a set of pre-finished combinations, but that doesn't prevent
it from using plain old superposition (or fancied-up superposition) as a
default solution.  After all, the combination tells it what it should look
like, right? 

Endianism is a tricky question, but in most cases there is precedent.  For
telecommunication, both CCITT and Internet standards advocate bigendianism
(Motorola style).  Check out what the sequence of bits are out of a
V.24/RS-232 port.  Bigendian.  Thus that is probably the preferred style
for interchange.  For word processors etc. there are usually
numeric fields which have had to be resolved; mostly as the style dominant
on the machine it was introduced on.
[P.S. As a programmer, I prefer littleendian (Intel) style; while a
bigendian hex dump is easier to read, littleendianism avoid many of the
problems with different variable sizes.   D.S.]

I also think there should be a recommended mangling scheme for converting
Unitext to ASCII text spectrum (NOT octet spectrum) for purpouses like
Internet mail, which not is very likely to change any time soon.  I have
given the question some thought but I am not going to say anything until I
have figured out a "safe" way that could also distinguish between Unitext
and ASCII text.

                                     /Peter
-- 
IDENTITY:   Anvin, H. Peter           STATUS:    Student
INTERNET:   hpa@casbah.acns.nwu.edu   FIDONET:   1:115/989.4
HAM RADIO:  N9ITP, SM4TKN             RBBSNET:   8:970/101.4
EDITOR OF:  The Stillwaters BBS List  TEACHING:  Swedish

ck@voa3.VOA.GOV (Chris Kern) (05/05/91)

In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no
(Erik Naggum) writes:

>Consider the Norwegian and French words for a small restaurant,
>spelled "cafe'" (where the ' serves as a floating acute accent for
>rendering purposes in the absence of an international character set
>standard in which we wouldn't need it :-).  In Norwegian, the acute
>accent over e is optional, it's an ornament to indicate stress,
>toneme, etc.  It's not orthographically required.  In French, an e
>with acute is a different orthographic unit than plain, unadorned e.
>
>This means that in Norwegian, we can make do with a floating acute
>accent, since the function of the acute accent is to modify the
>character with which is combined.  In French, however, they cannot
>make do with a floating acute accent because the acute accent does not
>have a function by itself.  Rather, the unit is "e with acute".

I confess that I don't understand the problem.  Regardless of the
attributes of the underlying language, is there some reason why I
should care whether a character-diacritic combination is stored as
one code or two as long as (a) its image is properly rendered when
I need to look at it and (b) a program which consumes a text stream
that includes such (character-diacritic) combinations can
unambiguously determine its content?  (Of course, if I can meet
requirement "b" presumably I can meet requirement "a" as well.)

-- 
Chris Kern     ck@voa3.voa.gov     ...uunet!voa3!ck     +1 202-619-2020

djbpitt@unix.cis.pitt.edu (David J Birnbaum) (05/05/91)

In article <1991May4.180549.29162@voa3.VOA.GOV> ck@voa3.VOA.GOV 
(Chris Kern) writes:

>I confess that I don't understand the problem.  Regardless of the
>attributes of the underlying language, is there some reason why I
>should care whether a character-diacritic combination is stored as
>one code or two as long as (a) its image is properly rendered when
>I need to look at it and (b) a program which consumes a text stream
>that includes such (character-diacritic) combinations can
>unambiguously determine its content?

Yes and no.  One could encode English logographically, but we don't
do it because (among other things) people don't process English text
logographically; they do it by character.  Similarly, we can encode
Hebrew consonant plus vertically aligned vowel points and cantilla-
tion marks as single characters, but people don't work with Hebrew
text this way.

One practical consequence of encoding vowel+accentual_diacritic
variously is the way it affects natural classes.  I can search for
all words with long rising accents in Serbocroatian (graphically
an acute) more easily if the acute is a separate character.  I
would not want to conduct such a search in French, where letters
with acute do not constitute a natural class (i.e., where "acute"
does not have an independent meaning), but this is not an unnatural
type of search to make in Serbocroatian, comparable to searching for
all words with any other letter.

As another example, I can
strip the (orthographically optional) accents from a Serbocroatian
text more efficiently by searching and deleting the five accentual
diacritics than by searching and replacing each accented vowel by
its unaccented counterpart.  Again, this is not something one would
normally want to do for French.

One other issue is efficient use of character cells.  If there is
a small number of vowels and a small number of accent marks (to
use a common example and imprecise terminology), there isn't a lot
at stake.  But take a system with lots of vowel letters, lots of
accent marks, the possibility of multiple accent marks on a single
vowel, and you're talking about a lot of character cells if each 
one is to be treated as an indivisible unit.  And it is writing
systems where accent marks are productive units that are combined
ad hoc with a natural class of letters (such as vowels) that have
this large number of combinations.

At a certain level, the answer to your question is that it doesn't
matter.  This seems to be why 10646 and Unicode have been able to
take opposing positions on the issue; both are concerned with form,
rather than function, and anything that arrives at the correct form
fulfills the minimal requirements.  But there are plenty of writing
systems that aren't like English or like French and where you can
only support the full inventory of complex combinations either by
storing the combination as a sequence or by dedicating an extremely
large number of character cells.  For orthographies like this, the
former is more efficient and corresponds more directly to the types
of operations that users may want to perform on the text.

--David

=======================================================================
Professor David J. Birnbaum         djbpitt@vms.cis.pitt.edu [Internet]
The Royal York Apartments, #802     djbpitt@pittvms.bitnet   [Bitnet]
3955 Bigelow Boulevard              voice: 1-412-687-4653
Pittsburgh, PA  15123  USA          fax:   1-412-624-9714

ccplumb@rose.waterloo.edu (Colin Plumb) (05/05/91)

I'm not a great linguist (English, French, and German), but I also like
separate accents because it's so much easier to accomodate wierd uses.
Mathematicians put funny accents over and under every letter in creation.
Ever played with rho-hat?  Linguists and phoneticists may do the same.
And it's such a bother enumerating all the legal possibilities.
There's a CCITT standard which I can't seem to locate right now that
uses non-spacing accents, and it seems like the right thing to me.
Yess, e-acute is conceptually one thing in French, but qu and ph are
pretty distinct entities in English, and I can't say for sure how
different o-umlaut is from oe in German.  Mc and Mac have been
special-cased in many places in English (the correct all-caps spelling
of McDonald's is McDONALD'S), with superscript c's being common.

It's pretty impossible to come up with a character standard that
only lets you do sensible things.  All I can suggest is, don't do
the senseless ones.  Treat accented characters as double-byte characters
(recognizable by the first byte) if the accents are inseparable, but
don't if they can be logically separated.

The CCITT standard also specifies a subset of the possible combinations
that are required to be displayable, but the usual cheap implementation
is probably accents plus some sort of character-height information,
while a higher-rent scheme uses some dedicated pairs, with fallback to the
former.  Separate accents makes the low-cost scheme much easier,
without seriously hampering the higher-cost one.  Good typesetting systems
already handle ligatures and kerning as is.
-- 
	-Colin

kkim@plains.NoDak.edu (kyongsok kim) (05/05/91)

In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:

:Now, ISO DIS 10646 is of the "insist on all combinations" persuasion,
                                         ^^^^^^^^^^^^^^^^
Is it explicityly specified in any document or just an implicitly accepted
principle?
					 
:but has non-spacing characters for languages in which the "separate
:unit of information" is eminently the case (e.g. Hebrew).  I've come
:to learn that this is overly restrictive in many, many cases.
                              ^^^^^^^^^^^
Could you please explain this in more detail?  I am a little bit
confused.  Do you mean than "insisting on all combinations" is too
restrictive and therefore somewhat unreasonable for some languages?
Or somehting else?

--------------------------------------------

I will give one example showing the "all combinations" principle is not
applicable in at least one case.  In case of Ancient Hangul, nobody knows
exactly what combinations of characters (i.e., syllables) were used in the
past, although component letters (or characters) of the syllables are
completely known.  Every time a scholar finds a new syllable, does he/she
have to report it to a national standards body, which will again report
it to ISO?  The scholar may not be able to represent and send that
character until the national standards body modifies its standard and
then ISO modifies 10646.  How long will it take?  If ISO simply drops the
"all combinations" principle (as with Hebrew), the whole problem can be
solved immediately.  (The solution is already known!)

I am still wondering whether Hebrew is the only script in 10646 not
honoring the ISO's "all combinations" principle.  I tried to figure it out
but no luck yet.  There seem several scripts such as Devanagari and
several scripts used in India, Arabic and its several variants, Thai,
Laos, etc. which will have similar properties.

:[Erik Naggum]           Professional Programmer        <enag@ifi.uio.no>
:Naggum Software             Electronic Text          <erik@naggum.uu.no>
:0118 OSLO, NORWAY       Computer Communications            +47-2-836-863

Kyongsok Kim
Dept. of Comp. Sci., North Dakota State University

e-mail: kkim@plains.nodak.edu; kkim@plains.bitnet; ...!uunet!plains!kkim

peter@ficc.ferranti.com (Peter da Silva) (05/07/91)

In article <ENAG.91May3200814@maud.ifi.uio.no> enag@ifi.uio.no (Erik Naggum) writes:
> a commercial at-sign with acute accent and cedilla below doesn't make
> much sense.  What should a Unicode display device do with that
> sequence of characters?

                        ,
It better display it as @ (more or less), because someone's gonna use it.
                        '
If you don't believe that, then consider the use of "!#%^&*|" in C.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"